The Inference Stack in 2026
A Field Note on Token Economics, Runtime Systems, and Model Architecture
Manu Bhardwaj. ifitsmanu.com. 3 May 2026. Last updated 3 May 2026. Version 3.0.
Download as PDF (12 pages, full math). Cite this article. Research index.
v3.0 update. Introduces Verified Capability per Dollar (VCpD) as the operational unit of inference economics, with a multiplicative decomposition into four efficiency factors (quantization, runtime, decoding-time parallelism, hardware) calibrated against the 2023–2026 literature. Shows analytically that the Stanford 280-fold compression at fixed quality reduces to roughly from stack improvements plus ~3x from model-architecture progress. Also explains why GPT-5.5 raised prices in April 2026 without contradicting the long-run trend: at the highest GPQA-Diamond bin, the model-architecture term dominates the cost decomposition. Full derivation, definitions, propositions, and pseudocode in the PDF.
Sequel. The companion field note The Cost of Being Right. Verification Economics in 2026. (Field Notes #2) develops the “cheaper correct tokens” framing into a formal Cost-correct decomposition with explicit reasoning-multiplier and verification-accept-rate terms, applies the framework to OpenAI’s April 2026 GPT-5.5 reprice, and traces verification economics through the EU AI Act high-risk obligations entering force in August 2026.
Or view the full PDF inline.
TL;DR
Public LLM API prices fell sharply between 2023 and 2026, but not by a single clean scalar. GPT-4 launched in March 2023 at 60 per million input/output tokens; current pricing spans 1.25 (nano-class) to 30 (flagship). The compression came from four compounding stack-level changes: weight-only quantization (AWQ, GPTQ, FP8), memory-aware serving runtimes (PagedAttention, continuous batching), speculative decoding, and a hardware market competing on delivered tokens-per-dollar rather than peak TOPS. The operational unit of inference economics is no longer FLOPs or advertised TOPS. It is verified output quality per dollar at a specified latency, context length, and traffic distribution.
Abstract
The economics of large language model deployment changed substantially between 2023 and 2026. The original GPT-4 API launched at 60 per million completion tokens for the 8K model, while current public API prices span a much wider envelope: from 1.25 per million input/output tokens for nano-class models to 30 for current flagship models. This note argues that the price decline should not be described as a single clean “GPT-4-equivalent” scalar. It is better understood as the compound result of four stack-level changes. (i) weight-only quantization and mixed-precision kernels, (ii) memory-aware serving systems such as PagedAttention and iteration-level scheduling, (iii) speculative decoding and related decoding-time parallelism, and (iv) a hardware market in which GPUs, hyperscaler ASICs, and inference-specialized accelerators are all competing on delivered tokens per dollar. The practical engineering implication is simple. The unit of inference economics is no longer FLOPs or advertised TOPS. It is verified output quality per dollar at a specified latency, context length, and traffic distribution.
1. Why the headline needed correction
The inference stack is the layered system that determines per-token cost in production LLMs: model architecture, weight precision and quantization scheme, serving runtime (memory management, batching, scheduling), decoding strategy (greedy, sampled, speculative), hardware (GPU, ASIC, edge accelerator), and the eval surface that decides whether the tokens are actually correct. Discussing inference economics without naming which of those layers moved is what produces compressed claims like “inference is now 1000x cheaper.”
The phrase “GPT-4-equivalent inference is now $0.40 per million tokens” is too compressed to be defensible without a benchmark, a token mix, a latency target, and a definition of equivalence. A public API price can be measured. Model equivalence cannot be inferred from price alone.
A more precise claim is the following. Public language-model API prices have compressed sharply since GPT-4 launched in March 2023, but the compression is uneven across model classes. GPT-4 launched at 60 per million completion tokens for the 8K model, and 120 for GPT-4-32K. GPT-4o mini later launched at 0.60 per million input/output tokens, while current flagship pricing is materially higher than mini and nano-class pricing. As of this writing, OpenAI’s public pricing page lists gpt-5.5 at 30 per million input/output tokens, gpt-5.4-mini at 4.50, and gpt-5.4-nano at 1.25 for standard short-context use. These are price points, not quality-normalized capability statements.
For the rest of this note, I use a simple blended cost-per-million metric:
where Pinput and Poutput are public API prices per million tokens. This is deliberately simple. Real production CPM depends on cache hit rate, batch / flex tier, prompt-to-output ratio, retry behavior, tool calls, latency tier, and the cost of verification.
| Model / date | Input | Output | CPM1:1 |
|---|---|---|---|
| GPT-4 8K, Mar. 2023 | 30.00 | 60.00 | 45.00 |
| GPT-4 32K, Mar. 2023 | 60.00 | 120.00 | 90.00 |
| GPT-4o mini, Jul. 2024 | 0.15 | 0.60 | 0.375 |
| GPT-4.1, Apr. 2025 | 2.00 | 8.00 | 5.00 |
| GPT-5.4 nano, Mar 2026 | 0.20 | 1.25 | 0.725 |
| GPT-5.4 mini, Mar 2026 | 0.75 | 4.50 | 2.625 |
| GPT-5.4, Mar 2026 | 2.50 | 15.00 | 8.75 |
| GPT-5.5, Apr 2026 | 5.00 | 30.00 | 17.50 |
Stanford’s 2025 AI Index anchors the decline concretely: at GPT-3.5 quality (MMLU 64.8), public-API inference cost fell from 0.07 per million tokens (Gemini 1.5 Flash 8B) in October 2024, a 280-fold compression at that quality bin over that window. OpenAI separately reports that GPT-4o mini’s cost per token had dropped 99 percent relative to text-davinci-003. The MIT FutureTech Price of Progress analysis (arXiv:2511.23455) decomposes the heterogeneity: in the highest GPQA-Diamond bin, frontier-quality cost falls roughly 31x per year; in the lowest bin, only 1.7x per year. A single “X-fold cheaper” headline is therefore wrong by a factor of about 18 depending on which bin is sampled.
Two observations make the picture more interesting than the headline. First, the decline is not monotonic at the frontier: GPT-5.5 was released April 23, 2026 at 30.00 per million input/output tokens, a 2x increase over GPT-5.4 (15.00) and the first time in three years that an OpenAI flagship raised prices versus its predecessor. Second, dollars per million tokens is the wrong unit on its own: the same $0.20 per million tokens buys very different capability in 2024 versus 2026. The PDF develops the right unit (Verified Capability per Dollar) formally and decomposes it.
2. The market moved from training economics to inference economics
Training is capital-intensive and episodic. Inference is continuous. Once models are deployed into search, coding, agents, voice, document processing, and internal enterprise workflows, the relevant question becomes not “how many FLOPs can I buy?” but “how many correct, low-latency, policy-compliant tokens can I deliver per dollar?”
This is why cost per token has become a more useful operational metric than raw FLOPs. The metric folds together hardware acquisition or rental cost, memory bandwidth, batching efficiency, KV-cache utilization, software kernels, queueing behavior, and energy. It also exposes a frequent measurement error. A team may optimize a model benchmark while ignoring the serving path that dominates the user-visible bill.
Deloitte’s 2026 TMT prediction estimates that inference workloads will account for roughly two-thirds of all AI compute in 2026, up from about one-third in 2023 and half in 2025, and that inference-optimized chips will exceed $50 billion in 2026. McKinsey’s workload model projects that inference will become the dominant AI data-center workload by 2030. The direction is not controversial. The operational center of gravity is shifting from training runs to serving systems.
The operational unit of inference economics is no longer FLOPs or advertised TOPS. It is verified output quality per dollar at a specified latency, context length, and traffic distribution.
3. Quantization. Lower memory traffic, not free accuracy.
The first inference lever is quantization. The important production pattern is weight-only quantization. Store weights in a lower-bit format while keeping activations in higher precision. In W4A16, weights are represented with 4-bit integers and activations remain 16-bit. In the idealized case, moving from FP16 / BF16 weights to 4-bit weights cuts weight storage by roughly 75 percent. End-to-end memory reduction can be smaller because KV cache, activations, framework overhead, and batching policy still matter.
AWQ, or Activation-aware Weight Quantization, observes that only a small fraction of weights are especially sensitive. The AWQ paper reports that protecting about 1 percent of salient weights can substantially reduce quantization error, and it identifies these channels using activation statistics rather than weight magnitude alone. GPTQ takes a different route. It performs one-shot post-training quantization with approximate second-order information. The GPTQ paper reports 3-bit and 4-bit quantization of very large GPT-family models with small degradation and end-to-end speedups over FP16 on A100 / A6000-class GPUs.
Quantized weights only produce production gains when the serving path uses kernels that avoid giving the savings back through dequantization overhead or poor memory layout. This is where Marlin matters. Marlin is a family of mixed-precision kernels designed for batched autoregressive inference. Its core observation is that quantized LLM inference is often memory-bound, so reducing weight movement can approach the theoretical speedup from lower precision if the kernel layout and scheduling are correct.
The engineering rule is not “always use AWQ.” It is. Benchmark W4A16 / AWQ / GPTQ / FP8 / NVFP4 or equivalent formats on the exact hardware, model, batch regime, and quality suite you will serve. On NVIDIA Ampere, Ada, Hopper, and Blackwell, vLLM’s quantization documentation lists AWQ, GPTQ, and Marlin paths among supported formats, which makes a quantized-kernel path a reasonable default candidate to test before shipping a full-precision server.
4. Serving runtime. Memory management and scheduling.
The second lever is the runtime. Autoregressive inference stresses systems in a specific way. Prefill is compute-heavy, decode is often memory-bandwidth-bound, request lengths vary, and the KV cache grows and shrinks dynamically.
PagedAttention addressed a central memory-management problem. In vLLM, KV-cache memory is managed in blocks analogous to virtual memory pages. This reduces fragmentation and permits sharing of key-value blocks across requests. The vLLM paper reports near-zero KV-cache waste and 2 to 4x throughput improvement at the same latency compared with prior serving systems such as FasterTransformer and Orca.
Continuous batching, also called iteration-level scheduling, attacks a different bottleneck. Static batching forces the whole batch to wait for the slowest request. Orca instead schedules at the granularity of generation iterations, so completed requests can leave and new requests can enter without waiting for the entire batch to finish. Orca’s OSDI paper reports a 36.9x throughput improvement over FasterTransformer at the same latency on a GPT-3 175B serving setup. That number should be read as a systems-paper result against a specific baseline, not as a universal multiplier. The underlying principle is the important part.
Speculative decoding adds a decoding-time parallelism lever. A small draft model proposes a short continuation. The large target model verifies multiple candidate tokens in one pass. If the draft agrees with the target, multiple tokens are accepted. If not, the system falls back to the first mismatch. The original speculative decoding paper reports 2 to 3x acceleration on T5-XXL with identical outputs, and DeepMind’s speculative sampling paper reports 2 to 2.5x speedup on a 70B Chinchilla model without compromising sample quality.
These techniques compound, but not linearly. The bottleneck changes as each improvement lands. Quantization reduces weight movement. PagedAttention improves KV-cache packing. Continuous batching lifts occupancy under heterogeneous request lengths. Speculation reduces the number of expensive target-model passes. A serving stack that gets all four right can be dramatically cheaper than a naive PyTorch / Hugging Face loop, but the only honest number is the one measured under the production traffic distribution.
5. Hardware. GPUs remain central, but the inference market is contested.
The hardware story is not “NVIDIA lost inference” or “ASICs replaced GPUs.” The more accurate claim is that inference made specialization economically attractive. Once a workload stabilizes, buyers can optimize for tokens per watt, tokens per dollar, memory locality, networking, and software support.
Deloitte lists inference-optimized chips and accelerators from Meta, Google, Amazon, Intel, AMD, Qualcomm, Groq, SambaNova, Cerebras, Graphcore, and others. This does not remove the need for GPU clusters. It creates a heterogeneous procurement problem. GPUs retain advantages in flexibility, ecosystem maturity, training, post-training, and fast model churn. ASICs and inference-specialty accelerators can be attractive when workloads are predictable, batchable, and large enough to justify integration costs.
For engineers, the decision variable is not peak TOPS. Peak TOPS usually ignores memory bandwidth, interconnect, KV-cache behavior, software support, and the cost of hitting latency SLOs. The correct benchmark, as NVIDIA’s own 2026 framing acknowledges, is tokens per second per dollar at a fixed quality target, context length, concurrency distribution, and latency SLO.
6. Architecture. Long context pushed the stack beyond pure attention.
Pure Transformer attention was the default production architecture from roughly 2017 through 2024. It still anchors most production today, but long-context serving exposes the cost of KV-cache growth. At 128K-plus contexts, KV cache can dominate memory and limit batch size.
State-space models such as Mamba offer a different scaling profile. The Mamba paper reports linear scaling in sequence length and 5x higher inference throughput than Transformers in its setting. Hybrid architectures combine attention with state-space layers. Attention is retained where global mixing is valuable, while linear-complexity layers carry much of the sequence-processing burden.
Jamba-1.5 is the clean reference case. The Jamba-1.5 paper describes a hybrid Transformer-Mamba mixture-of-experts model with 398B total parameters, 94B active parameters, and an effective 256K-token context. It reports roughly an order-of-magnitude reduction in KV-cache memory at 256K context compared with similarly sized open models. This is the architectural reason long-context inference is no longer only a question of renting more HBM.
The practical conclusion is not that every production model should be Mamba-like. It is that long-context architecture and serving architecture must be designed together. A retrieval-heavy 8K system, a 256K document-analysis system, and a real-time voice agent should not share the same default inference assumptions.
The inference stack in 2026 is not one breakthrough. It is a compound curve.
7. Hallucination is also an inference-stack problem.
Hallucination belongs in an inference-stack note because production reliability is part of delivered token quality. A cheap token that is confidently wrong can be more expensive than no token.
OpenAI’s 2025 paper, Why Language Models Hallucinate, argues that hallucinations persist partly because standard training and evaluation procedures reward guessing over calibrated uncertainty. Stanford HAI’s legal-domain work found hallucination rates ranging from 69 percent to 88 percent on specific legal queries for GPT-3.5, Llama 2, and PaLM 2. Those legal numbers should not be generalized to every modern model or every domain, but they show the key pattern. Hallucination rates vary sharply by task, model, data availability, and verification surface.
The production mitigation is not a better prompt alone. It is a system. Abstention-aware evaluation, retrieval with source constraints, span-level verification, uncertainty surfacing, domain-specific eval sets, and a human escalation path for high-risk outputs.
8. Engineering implications
-
Report CPM with context. A useful CPM number includes model, token mix, cache rate, batch tier, average input/output lengths, SLO, tool-call overhead, and quality gate. A naked price-per-million-tokens number is incomplete.
-
Benchmark quantized serving before shipping full precision. W4A16, AWQ, GPTQ, FP8, NVFP4, and related formats should be treated as candidates, not slogans. The best choice depends on model family, hardware, batch size, context length, and eval sensitivity.
-
Profile prefill and decode separately. The bottleneck during prefill is not necessarily the bottleneck during decode. Track TTFT, TPOT, queueing delay, KV-cache occupancy, accepted speculative tokens, and tokens per watt.
-
Do not design long-context systems as prompt-length extensions only. At 128K-plus contexts, architecture, retrieval, KV-cache layout, prefix caching, and verification become one design problem.
-
Treat factuality as part of serving quality. Production inference should measure not only latency and throughput, but also abstention, citation accuracy, retrieval coverage, and verified answer rate.
9. Conclusion
The inference stack in 2026 is not one breakthrough. It is a compound curve. Public API prices fell because models became smaller and better, quantized serving became practical, kernels improved, KV-cache memory was managed more intelligently, schedulers stopped wasting batches, speculation reduced serial decode cost, and hardware competition moved from peak FLOPs to delivered tokens.
The next engineering regime will be defined less by whether inference becomes cheaper in the abstract and more by how precisely teams can trade off cost, latency, context, reliability, and verification. The systems that win will not simply generate cheaper tokens. They will generate cheaper correct tokens under production constraints.
The systems that win will not simply generate cheaper tokens. They will generate cheaper correct tokens under production constraints.
Sequel. The companion field note The Cost of Being Right. Verification Economics in 2026. (Field Notes #2) develops the “cheaper correct tokens” framing into a formal Cost-correct decomposition with explicit reasoning-multiplier and verification-accept-rate terms, applies the framework to OpenAI’s April 2026 GPT-5.5 reprice, and traces verification economics through the EU AI Act high-risk obligations entering force in August 2026.
References
-
OpenAI. GPT-4o mini: advancing cost-efficient intelligence. July 18, 2024.
-
Stanford Institute for Human-Centered AI. The 2025 AI Index Report. 2025.
-
NVIDIA. Rethinking AI TCO. Why Cost per Token Is the Only Metric That Matters. April 15, 2026.
-
Lieber, O. et al. Jamba-1.5: Hybrid Transformer-Mamba Models at Scale. arXiv:2408.12570, 2024.
FAQ
How much did public LLM API prices fall between 2023 and 2026?
Per the Stanford 2025 AI Index, inference cost fell more than 280-fold between November 2022 and October 2024 for GPT-3.5-class quality. Headline “1000x” claims conflate model classes. The defensible decline is uneven across nano, mini, and flagship tiers. See Table 1 above for OpenAI public pricing across seven model checkpoints from March 2023 through May 2026.
What is the most useful operational metric for LLM inference economics?
Verified output quality per dollar at a specified latency, context length, and traffic distribution. Naked $/MTok numbers omit cache hit rate, batch tier, prompt/output ratio, retry behavior, tool calls, and the cost of verification. A useful CPM is conditioned on all of these.
What four stack-level changes drove the inference price decline?
(i) Weight-only quantization (AWQ, GPTQ, FP8, NVFP4) and matched mixed-precision kernels (Marlin). (ii) Memory-aware serving runtimes (PagedAttention, continuous batching, iteration-level scheduling). (iii) Speculative decoding and related decoding-time parallelism. (iv) A hardware market in which GPUs, hyperscaler ASICs, and inference-specialized accelerators compete on delivered tokens-per-dollar rather than peak TOPS.
Are GPUs still the right default for inference in 2026?
For most production LLM workloads, yes. GPUs retain advantages in flexibility, ecosystem maturity, training, post-training, and fast model churn. Inference-specialty accelerators (Groq, Cerebras, SambaNova, hyperscaler ASICs from Meta, Google, Amazon) become attractive when workloads are predictable, batchable, and large enough to justify integration costs.
What is the architectural reason long-context inference changed in 2026?
KV-cache memory grows linearly with context length, and at 128K-plus contexts it can dominate memory and limit batch size. Hybrid architectures such as Jamba-1.5 (Transformer + Mamba state-space + Mixture-of-Experts) report roughly an order-of-magnitude reduction in KV-cache memory at 256K context compared with similarly sized open Transformers. Long-context architecture and serving architecture must now be designed together.
Why does hallucination belong in an inference-stack note?
A cheap token that is confidently wrong can be more expensive than no token. Production inference quality is the product of latency, throughput, AND verified factuality. The mitigation is a system, not a prompt: abstention-aware evaluation, retrieval with source constraints, span-level verification, uncertainty surfacing, and a human escalation path for high-risk outputs.
Cite this article
@misc{bhardwaj2026inference,
author = {Bhardwaj, Manu},
title = {The Inference Stack in 2026: A Field Note on
Token Economics, Runtime Systems, and Model Architecture},
year = {2026},
month = {May},
url = {https://ifitsmanu.com/papers/the-inference-stack-2026},
note = {Field note. Version 1.0.}
}
Bhardwaj, M. (2026, May). The inference stack in 2026: A field note on token economics, runtime systems, and model architecture. ifitsmanu.com. https://ifitsmanu.com/papers/the-inference-stack-2026
Bhardwaj, Manu. "The Inference Stack in 2026: A Field Note on Token Economics, Runtime Systems, and Model Architecture." ifitsmanu.com, May 2026. https://ifitsmanu.com/papers/the-inference-stack-2026.
M. Bhardwaj, "The Inference Stack in 2026: A Field Note on Token Economics, Runtime Systems, and Model Architecture," ifitsmanu.com, May 2026. [Online]. Available: https://ifitsmanu.com/papers/the-inference-stack-2026