Verification economics

Verification economics is the framework that treats cost-per-correct-answer as the operational unit of inference economics in 2026, replacing cost-per-token. The binding lever in this regime is the verifier: the small model, RL reward function, programmatic check, or self-consistency aggregator that decides which generated tokens are worth keeping.

Definition

The Cost-correct unit decomposes as

Cost-correct = (CPM × R × (1 + ρ̄)) / α(θ, V)

where

The Verified Capability per Dollar framework introduced in Field Notes #1 is the special case R → 1, ρ̄ → 0, α → 1. Cost-correct extends VCpD by making the reasoning, rollout, and verification terms first-class denominators of the unit.

Why this matters in 2026

Three observable shifts justify the new unit.

First, reasoning is billed as output tokens. Across every major lab’s public pricing schedule as of May 2026, internally generated chain-of-thought tokens are charged at the standard output rate. A reasoning model that emits a 50,000-token chain-of-thought before a 500-token final answer is a 100-to-1 reasoning-to-answer ratio billed entirely at the output rate.

Second, the multiplier is large and variable. Recent benchmarks (OckBench, arXiv:2511.05722) measure up to a 5x token-efficiency dispersion between reasoning models that achieve similar accuracy on the same problem.

Third, the ARC-AGI-2 leaderboard shows a 70x-plus cost-per-task spread across published frontier configurations at near-equivalent accuracy. The dispersion is verification-conditional, not capability-conditional.

The lever

CPM compresses through stack-level engineering: quantization, kernels, runtime, hardware. α (the verification accept rate) compresses through training-side and inference-side verifier engineering. The two are different disciplines.

Training-side verifiers concentrate capital into RL with verifiable rewards (RLVR, named in Tulu 3, Lambert et al. 2024) and process reward models (PRM800K, Lightman et al. 2023). DeepSeek-R1 (Nature 645:633-638) is the canonical demonstration of pure-RL reasoning with verifiable rewards.

Inference-side verifiers include best-of-N selection, self-consistency over sampled paths, Monte Carlo Tree Search at decode time (rStar-Math, Guan et al. 2025), and self-evaluation in Tree of Thoughts (Yao et al., NeurIPS 2023).

The economic claim is that a small generator coupled to a small verifier can beat a large monolithic reasoning model on a per-task-cost basis. rStar-Math improves Phi3-mini-3.8B’s MATH accuracy from 41.4% to 86.4% by routing through a process preference model, surpassing o1-preview at small scale.

Production guidance

References

  1. Bhardwaj, M. The Cost of Being Right: Verification Economics in 2026. ifitsmanu.com, May 2026. link
  2. Lambert, N. et al. Tulu 3: Pushing Frontiers in Open Language Model Post-Training. arXiv:2411.15124, 2024. link
  3. Lightman, H. et al. Let’s Verify Step by Step. arXiv:2305.20050, 2023. link
  4. DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025. link
  5. Du, Z. et al. OckBench: Measuring the Efficiency of LLM Reasoning. arXiv:2511.05722, 2025. link

Glossary. Research index. Home.