AI Token Inflation Is Not a Bug: It's the Business Model

In most LLM APIs, every call is billed in tokens. Not for results, not for output quality, not for problems solved — for tokens. The equivalent of paying a programmer by keystroke.

When I talk about AI token inflation, I do not only mean an increase in the unit price of a token. The point is subtler: the number of tokens required to get the same useful result can increase over time. Even if nominal prices fall, the real cost of a feature can still grow.

Rupert Goodwins at The Register says it plainly: token billing is not a temporary placeholder waiting for something better. It is a system built to grow — and it will.

AI Token Billing: A Metric That Does Not Measure Value

A token is a computational unit internal to language models — not a measure of productive work. Using it as the billing basis is like measuring code quality by lines produced.

The difference is that nobody built an industry worth thousands of billions on lines of code. With tokens, they did.

Goodwins uses a brutal comparison: billing by token makes even less sense than paying programmers by keystroke. The point is simple: the metric does not measure useful work, results, or quality. It measures consumption.

There is no truly stable way to consume tokens efficiently if the vendor can change the model’s behavior, response verbosity, system prompts, or reasoning mechanisms from the inside. A more verbose model consumes more. An inefficient implementation consumes more. Both increase the bill. Neither guarantees better output.

Why Token Consumption Grows Without You Noticing

AI providers have a structural incentive to sell more capacity and more reasoning over time. Competition may push unit prices down, but it does not change the fact that the metric remains in their hands: more verbose models, longer internal reasoning, opaque system prompts, larger context windows, more complex tool calling.

Every “optimization” shipped by a vendor can silently increase consumption while remaining almost invisible from a contractual point of view.

The clearest example is extended reasoning models. OpenAI exposes the count of reasoning tokens in the usage object, for example as output_tokens_details.reasoning_tokens in the Responses API. You can see how many tokens reasoning consumed, but you cannot inspect the full chain that produced them.

You are paying for a process you can quantify, but not fully inspect.

Anthropic, with extended thinking, offers more visibility because it returns thinking blocks in the API response, or summaries of them. But thinking tokens are still billed as output tokens and, on complex problems, can significantly increase the cost of a call.

In both cases, the vendor controls how much “thinking” the model consumes by default.

What This Means for Teams Using LLMs in Production

The cost per task can grow even without an application-level mistake. All it takes is a model change, a different reasoning budget, a modified system prompt, or a new default behavior from the provider.

The lock-in that follows is both technical and cultural. Teams grow fluent in one provider’s tooling until every migration becomes a project, not an endpoint swap.

Many ROI metrics also become fragile. They often assume stability in exactly the unit the vendor can change most easily.

Treating token costs as a dynamic variable — never as a fixed input in unit economics — is the only defensible position.

Take a simple example. If a feature costs €0.03 per request today, it is not enough to put that number into a spreadsheet and treat it as stable. Three months later, the same workflow could cost twice as much even if your application code has not changed. All it takes is a more verbose model, longer internal reasoning, or a different default behavior.

In practice, that means three things.

First: set per-call budget alerts the same way you would alert on any other cloud resource. Do not wait for the end-of-month invoice.

Second: benchmark token consumption on representative production traffic samples every week, not only at integration time, when costs can look artificially low.

Third: put a thin abstraction layer over your provider. Not for architectural elegance, but because when you eventually want to migrate, that abstraction may be one of the few things that makes the cost bearable.

LiteLLM, a generic HTTP gateway, or even a simple wrapper function may be enough. The goal is not perfect portability today. It is avoiding complete hostage status tomorrow.

But the point is not only to optimize prompts, alerts, and wrappers. Those are operational bandages. The real problem is that the unit of measurement for the AI market was not chosen by those buying value, but by those selling consumption.

TechMonk’s Take: Three Scenarios for the AI Market

The problem is not only that token billing is unfair. The problem is that nobody knows which of the following three scenarios will materialize.

In all three, you still pay.

Scenario 1 — Price Competition Lowers Prices, but Only for Now

Google cuts Gemini prices, OpenAI follows, Anthropic adjusts. Tokens deflate — not because of a structural change, but because the biggest providers use price as a weapon to capture developer market share.

This is a familiar cloud dynamic: aggressive pricing during the acquisition phase, followed by margin optimization once workloads, data, and skills are already inside the platform.

This deflation is a truce, not a stable equilibrium. It lasts as long as someone still needs to win market share. Once the market consolidates around three or four dominant providers, downward pressure may weaken. And that story has already been told.

The point is not that prices will certainly rise tomorrow. The point is that the unit price of a token does not, by itself, describe the real cost of the system. If the same task requires more tokens to complete, the effective cost can increase even during a phase of nominally lower prices.

Scenario 2 — Cognitive Lock-In Is the Real Cost

Goodwins talks about technical lock-in. But there is a more insidious kind that never shows up on any invoice: cognitive lock-in.

Teams redesign workflows around a specific model’s output: response tone, structured format, code snippet length, error tolerance, and the way the model interprets ambiguous instructions.

When the vendor changes model or snapshot — from one GPT family to a reasoning model, for example — those workflows can break.

The cost does not show up in the API bill. It is days of prompt re-engineering, regression testing, output format reviews, qualitative evaluation, and product changes. It does not appear in any unit economics metric, and it is not in any contract.

That is why migrating providers does not cost sprints. It can cost months, exactly like leaving a legacy architecture nobody wanted to touch anymore.

Scenario 3 — This Needs Governance, Not Just Competition

Today, there are no widely adopted open standards for measuring the actual value produced by models. The people who would benefit most from them — developers and companies paying for APIs — do not have enough market power.

The people who do have market power, meanwhile, have little incentive to give up control of the metric.

A minimum standard could be almost banal: cost per successful task, hidden tokens separated from visible tokens, average consumption drift between versions of the same model, mandatory changelogs when an upgrade changes the spending profile.

This is not just a market failure in the classical sense. It is a governance problem.

The parallel with regulated markets is not perfect, but the logic is similar: when buyers cannot properly measure what they are paying for, competition alone does not always correct the asymmetry.

The market works extremely well for vendors. Much less so for those trying to understand whether they are paying for value, inefficiency, or lock-in.

The way out of this system is unlikely to come only from competition between providers. It may require external standards, independent benchmarks, greater transparency around effective costs, and metrics closer to value produced than to raw consumption.

Conclusion

The question is not whether token consumption can become an economic lever. It already is.

But inflation should not be read only as an increase in unit price. It should be read as an increase in the cost required to obtain the same useful result.

The more interesting question is who will have enough market power to impose alternative measurement standards: hyperscalers, large enterprises, governments, platforms with enormous volume.

Today, that list is very short.

Developers are not on it.