I'm not sure, but I think that cached read costs are not the most accurately pri...

NitpickLawyer · 2026-02-16T09:25:07 1771233907

While some efficiencies could be gained from better client-server negotiation, the cost will never be 0. It isn't 0 even in "lab conditions", so it can't be 0 at scale. There are a few miss-conceptions in your post.

> the time it takes to generate the Millionth output token is the same as the first output token.

This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.

> cached input tokens are almost virtually free naturally

No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.

Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.

Now consider 100k users doing basically this, all day long. This is not free and can't become free.

TZubiri · 2026-02-16T20:57:45 1771275465

>This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.

I'm not strong on how transformers work, but this is something that is verifiable empirically, and has nothing to do with how transformers work.

Use any LLM through an API. Send 1 input token, and 10k output tokens. Then send 1 input token (different to avoid cache) and ask for 20k output tokens. If the cost and time to compute is exactly twice, then my theory holds.

>No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.

I was a bit loose in my definition of "virtually free", here is a more formal statement. The price of GPU compute is orders of magnitude more expensive than the cost of RAM, and the costs of caching inputs are tied to RAM and not GPU. To give an example of the most expensive price component, capital, an H100 costs 25K$, 1GB of RAM costs 10$. Therefore the cost component of cached inputs is negligible.

>Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.

As I said, sure it's not free, but you are talking about negligible costs when compared to the GPU capex. It's interesting to note that the API provider would charge the same no matter if the inference state is cached for 5 minutes, 1ms or 1 hour. So clearly the thing is not optimally priced yet.

If cached inputs from API calls become your primary cost, then it makes sense to move to an API that pays less for cached inputs (if you haven't already done that), then look into APIs where you can control when and when not to cache and for how long to hold it, and finally, into renting GPU and self-hosting an open weights model.

To give a concrete example, suppose we are building a feature where we want to stop upon hitting an ambiguous output token, our technical approach is to generate one output token at a time, check the logprobs, and continue if the prob of the top token is >90%, otherwise, halt. If we generate 1M output tokens with an API, we will pay for roughly 1M^2/2 cached input tokens, while if we self-host, the compute time will be almost identical to that of just generating 1M output tokens. Obviously if we do that with an API it will be almost entirely profit for the API provider, it's just not a use case that has been optimized for, we are in the early days of any type of deeply technical parametrization being done yet, everyone is just either prompting all the way down, or hacking with models directly, doesn't seem like a lot of in between.

mike_hearn · 2026-02-16T10:22:05 1771237325

GPU VRAM has an opportunity cost, so caching is never free. If that RAM is being used to hold KV caches in the hope that they'll be useful in future, but you lose that bet and you never hit that cache, you lost money that could have been used for other purposes.

TZubiri · 2026-02-16T21:02:24 1771275744

that cost is proportional to how long the cache is held. Currently the cache is not application controlled, it's like CPU caches.

If you hit the cache 1ns after it's been held, you get charged the same as if it's held for 5 minutes or 1 hour.

Also, in terms of LLM APIs, I'm almost certain that the state is offloaded onto RAM and then reloaded onto the GPU memory. If you are renting a GPU, you could keep the inferred state in GPU memory. If you are just holding it for very short periods of time, like my example of generating 1 output token at a time and doing some programmatic logic, then it's currently prohibitively expensive to use an API and you must self-host.

2001zhaozhao · 2026-02-16T09:17:58 1771233478

Caching might be free, but I think making caching cost nothing at the API level is not a great idea either considering that LLM attention is currently more expensive with more tokens in context.

Making caching free would price "100000 token cache, 1000 read, 1000 write" the same as "0 token cache, 1000 read, 1000 write", whereas the first one might cost more compute to run. I might be wrong at the scale of the effect here though.

lostmsu · 2026-02-16T20:52:23 1771275143

> To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.

This is wrong. Current models still use some full attention layers AFAIK, and their computational cost grows linearly (per token) with the token number.

TZubiri · 2026-02-16T21:09:17 1771276157

I have seen exactly one model that charges more for longer contexts:

https://ai.google.dev/gemini-api/docs/pricing

Gemini 1M context window

That said the cost increase isn't very significant, approximately 2x at the longer end of the context window.

This is in stark contrast with the quadratic phenomenon claimed by the article.

lostmsu · 2026-02-16T22:41:22 1771281682

They just do averaging. Imagine a quadratic pricing structure. Who'd want to deal with it?

TZubiri · 2026-02-17T07:35:56 1771313756

I guess 1.0001 ^2 is quadratic too, but note how it really only charges you 1.5x for more output tokens. Even if cost were quadratic with output length here, we are talking about a very small difference, nothing like the quadratic cost structure proposed by OP:

>Pop quiz: at what point in the context length of a coding agent are cached reads costing you half of the next API call? By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.

These are two different cost components, and the one you bring up is minor, OP is talking about a cost that at 1M output tokens, would cause the cost to be 20x per token. You are talking about a cost that at 1M output tokens would cost 1.5x, different things.

The first is an imperfection of the API encapsulation, the latter may be a natural cost phenomenon related to the internals of the state of the state of the art algorithms

lostmsu · 2026-02-17T11:33:21 1771328001

What are you talking about? The cost is quadratic in total conversation length in tokens.

eshaham78 · 2026-02-16T08:31:49 1771230709

[flagged]

2001zhaozhao · 2026-02-16T09:16:02 1771233362

Are you hosting your own infrastructure for coding agents? At least from first glance, sharing actual codebase context across compacts / multiple tasks seems pretty hard to pull off with good cost-benefit unless you have vertical integration from the inference all the way to the coding agent harness.

I'm saying this because the current external LLM providers like OpenAI tend to charge quite a bit for longer-term caching, plus the 0.1x cache read cost multiplied by # LLM calls, so I doubt context sharing would actually be that beneficial considering you won't need all the repeated context every time, so caching context results in longer context for each agentic task which might increase API costs by more overall than you save by caching.