Verification economics #
Verification economics is the framework that treats cost-per-correct-answer as the operational unit of inference economics in 2026, replacing cost-per-token. The binding lever in this regime is the verifier: the small model, RL reward function, programmatic check, or self-consistency aggregator that decides which generated tokens are worth keeping.
Definition #
The Cost-correct unit decomposes as
Cost-correct = (CPM × R × (1 + ρ̄)) / α(θ, V)
where
- CPM is the blended public-API cost per million tokens (input plus output, divided by two).
- R is the reasoning multiplier: the ratio of total billed output tokens (chain-of-thought plus final answer) to final-answer-only tokens for the same task. R = 1 for non-reasoning models. R can exceed 100 for reasoning models that perform extensive search.
- ρ̄ is the average rollout-or-rejection ratio under verifier-guided decoding (best-of-N, MCTS-at-decode, self-consistency). For a model that samples once, ρ̄ = 0. For a system that samples 16 candidates and verifies, ρ̄ ≈ 15.
- α(θ, V) is the verification accept rate at quality threshold θ on verifier V.
The Verified Capability per Dollar framework introduced in Field Notes #1 is the special case R → 1, ρ̄ → 0, α → 1. Cost-correct extends VCpD by making the reasoning, rollout, and verification terms first-class denominators of the unit.
Why this matters in 2026 #
Three observable shifts justify the new unit.
First, reasoning is billed as output tokens. Across every major lab’s public pricing schedule as of May 2026, internally generated chain-of-thought tokens are charged at the standard output rate. A reasoning model that emits a 50,000-token chain-of-thought before a 500-token final answer is a 100-to-1 reasoning-to-answer ratio billed entirely at the output rate.
Second, the multiplier is large and variable. Recent benchmarks (OckBench, arXiv:2511.05722) measure up to a 5x token-efficiency dispersion between reasoning models that achieve similar accuracy on the same problem.
Third, the ARC-AGI-2 leaderboard shows a 70x-plus cost-per-task spread across published frontier configurations at near-equivalent accuracy. The dispersion is verification-conditional, not capability-conditional.
The lever #
CPM compresses through stack-level engineering: quantization, kernels, runtime, hardware. α (the verification accept rate) compresses through training-side and inference-side verifier engineering. The two are different disciplines.
Training-side verifiers concentrate capital into RL with verifiable rewards (RLVR, named in Tulu 3, Lambert et al. 2024) and process reward models (PRM800K, Lightman et al. 2023). DeepSeek-R1 (Nature 645:633-638) is the canonical demonstration of pure-RL reasoning with verifiable rewards.
Inference-side verifiers include best-of-N selection, self-consistency over sampled paths, Monte Carlo Tree Search at decode time (rStar-Math, Guan et al. 2025), and self-evaluation in Tree of Thoughts (Yao et al., NeurIPS 2023).
The economic claim is that a small generator coupled to a small verifier can beat a large monolithic reasoning model on a per-task-cost basis. rStar-Math improves Phi3-mini-3.8B’s MATH accuracy from 41.4% to 86.4% by routing through a process preference model, surpassing o1-preview at small scale.
Production guidance #
- Report Cost-correct, not CPM, when communicating production economics.
- Specify the verifier alongside the model: any “X% accuracy at $Y per task” claim is incomplete without naming the verifier under which X is measured.
- Track α as a first-class production metric. A regression in α is a more expensive failure than a CPM spike.
- Treat the verifier as a deployable artifact: versioned, evaluated, monitored for drift, often smaller and quantized, often deployable on-device.
Related #
- The Cost of Being Right. Verification Economics in 2026.. Field note that introduces the framework with full bibliography and PDF.
- The Inference Stack in 2026. Defines Verified Capability per Dollar (VCpD); Cost-correct extends it.
- AWQ quantization. One of the four CPM levers.
- Speculative decoding. Decoding-time parallelism lever.
References #
- Bhardwaj, M. The Cost of Being Right: Verification Economics in 2026. ifitsmanu.com, May 2026. link
- Lambert, N. et al. Tulu 3: Pushing Frontiers in Open Language Model Post-Training. arXiv:2411.15124, 2024. link
- Lightman, H. et al. Let’s Verify Step by Step. arXiv:2305.20050, 2023. link
- DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025. link
- Du, Z. et al. OckBench: Measuring the Efficiency of LLM Reasoning. arXiv:2511.05722, 2025. link