Research Program / active

Inference Economics

Cost, quality, latency, verification, and hardware structure for running AI systems.

Questions

When does cost-per-token stop explaining system economics?
Which levers move cost-per-correct-answer fastest?
How do runtime improvements and verifier improvements compound?

Topic Links

Inference Economics Verification Economics Verifier Transfer Audit

Linked Artifacts

Paper / May 16, 2026

Disaggregated or Colocated? The Cost-Frontier of LLM Serving Under SLO Contracts.

Research Paper #1 in the AI systems engineering wedge. A closed-form decomposition of cost per SLO-compliant served token into a prefill term, a decode term, and a KV-transfer tax. Re-derives published throughput from five 2023–2025 systems papers (PagedAttention, Sarathi-Serve, DistServe, Splitwise, Mooncake) into a common frame, plots the first cross-system Pareto frontier under explicit p99 TTFT and p99 TPOT contracts, and solves the break-even surface between colocated and disaggregated architectures. The frontier partitions cleanly.

Paper / May 11, 2026

Calibration Drift Under Verifier Composition. A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization.

Research Paper #2 in the verification-economics wedge. Per-verifier strictly proper elicitation does not compose. Pipeline miscalibration under any monotone Boolean composition rule equals the within-instance verifier-disagreement covariance exactly. A joint scoring-rule mechanism on the cross-product report space restores DSIC and minimax-optimal regret of order sqrt((log K_1 + log K_2) / N). Per-component procurement records are insufficient evidence under the August 2026 EU AI Act high-risk obligations on composed pipelines.

Paper / May 15, 2026

The Inference-Time Compute Frontier. A Cost-Correct Threshold for Training Versus Test-Time Allocation.

Research Paper #2 in the inference-economics wedge. Derives a closed-form threshold under the Cost-correct decomposition for when the marginal compute dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel. Calibrated against rStar-Math, DeepSeek-R1, and test-time-compute curves; matches the observed frontier-vs-commodity market split.

Paper / May 15, 2026

The Routing Premium. An Economic Threshold for Difficulty-Conditional Inference Compute.

Research Paper #3 in the inference-economics wedge. Derives a closed-form threshold under the Cost-correct decomposition for when conditioning inference compute on a noisy difficulty estimate reduces cost-per-correct-answer: routing pays iff κ·Δ > γ, where κ is classifier calibration, Δ is workload heterogeneity, and γ is classifier overhead. Unifies five published patterns (speculative decoding, cascades, adaptive self-consistency, complexity-aware exploration, early exit) as one allocation rule, and calibrates against six deployed systems with every operating point on the positive side of the threshold.

Paper / May 10, 2026

Verifier Procurement Under Unobservable Quality. A Scoring-Rule Mechanism for Cost-Correct Minimization.

Original research paper. Posted-price markets for verification-as-a-service collapse to the worst verifier under unobservable quality. A scoring-rule mechanism on adversarially constructed grounded probes is dominant-strategy incentive-compatible, with matching minimax regret bounds of order sqrt(log K / N).

Field Note / May 11, 2026

The Heterogeneous-GPU Margin. Coral and the Multi-LLM Procurement Problem.

A daily field note on Mei, Li, Chen, Pan, Wu, Miao, Jia, and Rashmi's Coral. Cost-efficient multi-LLM serving over heterogeneous cloud GPUs. Why the fragmentation of the LLM market and the heterogeneity of GPU supply make joint allocation the binding cost lever.

Field Note / May 16, 2026

Harvesting Serving Slack. ROSE and the Collapsed Train-Serve Boundary.

A daily field note on Gao, Zhao, Muhtar et al.'s ROSE. Cooperative elasticity for agentic RL rollouts on idle serving GPUs. Why the rollout-cost term in Cost-correct can be priced at the marginal-of-idle rate, and what that does to the inference-frontier threshold.

Field Note / May 17, 2026

The Power-Cap Illusion. SM Clock Locking and the Real Decode Lever.

A daily field note on Ma, Afzal, Eitzinger, and Wellein. Power capping does not bite in memory-bound LLM decode on NVIDIA H200. SM clock locking recovers up to 32% of decode energy. Why the standard energy lever moves the wrong knob, and what that does to the decode-cost term in Cost-correct.

Field Note / May 10, 2026

Inference Economics

Questions

Topic Links

Linked Artifacts

Disaggregated or Colocated? The Cost-Frontier of LLM Serving Under SLO Contracts.

Calibration Drift Under Verifier Composition. A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization.

The Inference-Time Compute Frontier. A Cost-Correct Threshold for Training Versus Test-Time Allocation.

The Routing Premium. An Economic Threshold for Difficulty-Conditional Inference Compute.

Verifier Procurement Under Unobservable Quality. A Scoring-Rule Mechanism for Cost-Correct Minimization.

The Heterogeneous-GPU Margin. Coral and the Multi-LLM Procurement Problem.

Harvesting Serving Slack. ROSE and the Collapsed Train-Serve Boundary.

The Power-Cap Illusion. SM Clock Locking and the Real Decode Lever.

The Verifier as Curriculum. VHG and the Third Role.

The Structural Residual Ceiling. AI Pre-Decoders for the Surface Code.

The Alpha Asymmetry. Why Verifiers Can Be Smaller Than Generators.

The Exploit Tax. Why Verifier-Guided Reasoning Needs a Transfer Audit.

The Cost of Being Right. Verification Economics in 2026.

The Inference Stack in 2026.

AWQ Quantization

Speculative Decoding

Mamba And State-Space Models

Verification Economics

Search the public field-note archive.