The Hidden Thread in Token Business: Cost Is Set by KV Cache Hits, Not Throughput

The more I study the token business lately, the more I feel there's one angle that keeps getting overlooked.

Over the past year, when people benchmark inference performance, they mainly watch three numbers: absolute throughput, TTFT, and TPOT. How many requests can be batched, how fast the first token comes out, how fast each output token is. That's the standard talking point today.

But when you actually get down to serving, you find that what really drives token cost isn't throughput. It's whether the KV cache hits.

I. A 10× Gap Carved into the Price List

Several model APIs have cut prices recently. Open any pricing sheet and you'll see the input column was split in two a while back: cache hit and cache miss.

How big is the gap?

Anthropic charges 0.1× the base input price for cache reads, making it 10× cheaper. DeepSeek V4 Flash cache hit is $0.0028 per million tokens, cache miss $0.14 per million tokens, a 50× difference. Anthropic also charges 1.25× (5-minute version) or 2× (1-hour version) to write a cache. On April 26, DeepSeek cut cache hit prices across all models by another 10×.

At the machine level, the difference comes down to compute. A hit skips prefill; the machine only runs decode. A miss means recalculating from scratch, burning machine time and compute. The gap isn't a few percentage points. It's multiples, up to 10×.

The interesting part is this: once cache hit and miss are priced separately, some of the cost is yours to control through design, and the rest is entirely up to the vendor. Split like this, and "which API is cheaper" stops being comparable at the token level. You have to look at actual hit rates.

II. The Biggest Pitfall on the User Side: Model Routing

Let's talk about how users can mess this up.

Lately people have bought hard into "model routing." Hard tasks go to strong models, easy ones to weak models. It looks like savings on paper.

My view: switching models mid-session is usually a losing bet.

The clearest example is Claude Code. You've accumulated 300K of context, then midway decide Opus is too expensive for this step and switch to Sonnet. Claude Code now pops a warning that tells you explicitly: after switching, all previous cache is invalidated, and the next step must cold-start and recalculate. It didn't warn before. After enough complaints, they added it.

Why does it invalidate? Each model's KV representation is different, so cache can't be reused across models. Opus cache and Sonnet cache are two different things. The session hasn't changed, the cache key hasn't moved, but not a penny is saved on the recalculation cost.

Run the numbers. Current Opus 4.7 is $5/$25 per million tokens; Sonnet 4.6 is $3/$15. Sonnet is roughly 40% cheaper than Opus, not the 5× gap of the past. But that preceding 300K input goes from a cache hit (0.1× price) to cold calculation (1× price), so that single input cost jumps 10×. You save 40% on the model itself. Net it out, and the overall cost is actually over 5× higher.

Plus in agent workloads, tokens are almost always input-heavy and output-light. Prefill usually runs thousands of tokens per second. Decode runs dozens to just over a hundred. That's two orders of magnitude. The money is basically spent on input. Model routing destroys exactly the savings mechanism on the input side.

So mainstream agent design today revolves around "context stability." Don't swap models lightly, don't change tool structures, don't touch core prompts like CLAUDE.md halfway through. One move, and hit rates really do drop from 90% to 5%.

III. Claude Code's Solution: Spawn Sub-Agents

So what if part of a task really is better suited to a cheaper model?

Claude Code's solution is to spawn sub-agents.

The main session stays on Opus, preserving its hit rate. When you need to explore, batch-process, or run a specific sub-task, you call the Task tool to spin up a new agent. The new agent runs in its own isolated context, can pick a cheaper model, and maintains its own hit rate. When it's done, only a summary is passed back to the main session. The main session's cache isn't touched.

The precondition for this mechanism is that the sub-task's context needs differ enough from the main task's. If your sub-task happens to feed most of the main session's content into it, that's another cold start inside the sub-agent, and prefill eats up whatever you saved. This takes pretty fine-grained judgment.

IV. Server-Side KV Cache Engineering

How big is the server-side gap? Massive.

The crudest implementations don't design for cache at all. No reuse across users. Routing goes haywire; the request that should hit the machine holding the cache lands elsewhere. VRAM backs all cache capacity. In a system like that, no amount of user-side care can save you.

The mature example is the Mooncake framework from Moonshot AI, Tsinghua University, and Qijing Technology, a KVCache-centric disaggregated architecture. Prefill and decode clusters are separated, and underutilized CPU, DRAM, and SSD resources on GPU nodes are repurposed into a distributed KV cache pool. A KV cache scheduler handles queuing and routing. The paper cites a simulated 525% throughput gain; under real loads, requests handled increased by 75% to 115%.

The counterexample is Openclaw. This open-source agent took a lot of criticism, mostly because it stumbles at this layer. Its plugin architecture doesn't set a promptCacheKey by default. Pass it through a third-party proxy and you lose node affinity, so cache hit rate is nearly 0%. The total token volume isn't actually that high, but all input is priced at cache-miss rates, so the bill is ridiculous. About a month ago I looked at its trace: 60K+ input tokens in one request, 0% hit rate, $0.12 a pop. That's what happens when server-side cache has no targeted design.

V. The Model Layer Can Also Push in This Direction

Go one layer deeper: the model itself can free up massive room for KV cache.

DeepSeek has been the most systematic here. MLA (Multi-head Latent Attention) projects KV into a latent vector and back, compressing KV cache volume by over 90%. V3 kept this mechanism. Later they added Native Sparse Attention, which almost flattens KV cache growth in long contexts. Only then can the inference system build a cache pool at million-scale context lengths.

But once the model changes, cache hit determination logic changes too. Some inputs previously recognized as "prefix overlap" behave differently under sparse attention, so whether they hit needs realignment. The inference system has to be revised as well. That's why I say inference engineers can't just stare at throughput anymore. They have to redesign cache around the model architecture.

VI. Measuring This Is Hard

The most frustrating part of the whole chain is that evaluating the server side alone is useless, and evaluating the user side alone is also useless.

No matter how strong the server-side cache is, if the user side is designed like Openclaw, hit rates still won't rise. No matter how careful the user side is, hit a crude server with chaotic routing, insufficient capacity, and no cross-user reuse, and costs still leak.

So "which API is cheaper" can't be compared on a single dimension when it comes to tokens. Same hardware, same target. Good coordination between both ends versus each doing its own thing, and the total bill can differ by 5× to 10×.

Closing Thoughts

The cache hit/miss split on token pricing tables is the single most important thread in this whole thing. It gives users a clear incentive: high hit rates save you money. At the same time, it pressures providers. If your cache system isn't strong enough, you won't get the business.

I hope vendors also expose the cache hit mechanism itself. Otherwise users only know it exists without knowing how to optimize for it. There still aren't many vendors that can tie together model, server-side cache, and user-side usage end to end.

The edge side is still competing on raw performance. But once app density rises and agents really start running, KV cache will become a core issue there too. From cloud to edge, there's no way around it.