Topic

Inference Economics

The cost, latency, quality, and verification structure of running AI systems after training.

Why It Matters Here

Inference economics determines whether capability can be used repeatedly, reliably, and affordably.

Verification Economics Verifier Transfer Audit AI Systems Engineering

Programs

Inference Economics Distributed Runtimes

Linked Artifacts

Paper / May 16, 2026

Disaggregated or Colocated? The Cost-Frontier of LLM Serving Under SLO Contracts.

Research Paper #1 in the AI systems engineering wedge. A closed-form decomposition of cost per SLO-compliant served token into a prefill term, a decode term, and a KV-transfer tax. Re-derives published throughput from five 2023–2025 systems papers (PagedAttention, Sarathi-Serve, DistServe, Splitwise, Mooncake) into a common frame, plots the first cross-system Pareto frontier under explicit p99 TTFT and p99 TPOT contracts, and solves the break-even surface between colocated and disaggregated architectures. The frontier partitions cleanly.

Paper / May 15, 2026

The Inference-Time Compute Frontier. A Cost-Correct Threshold for Training Versus Test-Time Allocation.

Research Paper #2 in the inference-economics wedge. Derives a closed-form threshold under the Cost-correct decomposition for when the marginal compute dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel. Calibrated against rStar-Math, DeepSeek-R1, and test-time-compute curves; matches the observed frontier-vs-commodity market split.

Paper / May 15, 2026

The Routing Premium. An Economic Threshold for Difficulty-Conditional Inference Compute.

Research Paper #3 in the inference-economics wedge. Derives a closed-form threshold under the Cost-correct decomposition for when conditioning inference compute on a noisy difficulty estimate reduces cost-per-correct-answer: routing pays iff κ·Δ > γ, where κ is classifier calibration, Δ is workload heterogeneity, and γ is classifier overhead. Unifies five published patterns (speculative decoding, cascades, adaptive self-consistency, complexity-aware exploration, early exit) as one allocation rule, and calibrates against six deployed systems with every operating point on the positive side of the threshold.

Field Note / May 11, 2026

The Heterogeneous-GPU Margin. Coral and the Multi-LLM Procurement Problem.

A daily field note on Mei, Li, Chen, Pan, Wu, Miao, Jia, and Rashmi's Coral. Cost-efficient multi-LLM serving over heterogeneous cloud GPUs. Why the fragmentation of the LLM market and the heterogeneity of GPU supply make joint allocation the binding cost lever.

Field Note / May 16, 2026

Harvesting Serving Slack. ROSE and the Collapsed Train-Serve Boundary.

A daily field note on Gao, Zhao, Muhtar et al.'s ROSE. Cooperative elasticity for agentic RL rollouts on idle serving GPUs. Why the rollout-cost term in Cost-correct can be priced at the marginal-of-idle rate, and what that does to the inference-frontier threshold.

Field Note / May 17, 2026

The Power-Cap Illusion. SM Clock Locking and the Real Decode Lever.

A daily field note on Ma, Afzal, Eitzinger, and Wellein. Power capping does not bite in memory-bound LLM decode on NVIDIA H200. SM clock locking recovers up to 32% of decode energy. Why the standard energy lever moves the wrong knob, and what that does to the decode-cost term in Cost-correct.

Field Note / May 10, 2026

Inference Economics

Why It Matters Here

Programs

Linked Artifacts

Disaggregated or Colocated? The Cost-Frontier of LLM Serving Under SLO Contracts.

The Inference-Time Compute Frontier. A Cost-Correct Threshold for Training Versus Test-Time Allocation.

The Routing Premium. An Economic Threshold for Difficulty-Conditional Inference Compute.

The Heterogeneous-GPU Margin. Coral and the Multi-LLM Procurement Problem.

Harvesting Serving Slack. ROSE and the Collapsed Train-Serve Boundary.

The Power-Cap Illusion. SM Clock Locking and the Real Decode Lever.

The Verifier as Curriculum. VHG and the Third Role.

The Alpha Asymmetry. Why Verifiers Can Be Smaller Than Generators.

The Cost of Being Right. Verification Economics in 2026.

The Inference Stack in 2026.

AWQ Quantization

Speculative Decoding

Mamba And State-Space Models

Edge AI Silicon

Verification Economics

Search the public field-note archive.

Inference Economics

Why It Matters Here

Related Topics

Programs

Linked Artifacts