Verification economics

Verification economics is the framework that treats cost-per-correct-answer as the operational unit of inference economics in 2026, replacing cost-per-token. The binding lever in this regime is the verifier: the small model, RL reward function, programmatic check, or self-consistency aggregator that decides which generated tokens are worth keeping.

Definition

The Cost-correct unit decomposes as

Cost-correct = (CPM × R × (1 + ρ̄)) / α(θ, V)

where

CPM is the blended public-API cost per million tokens (input plus output, divided by two).
R is the reasoning multiplier: the ratio of total billed output tokens (chain-of-thought plus final answer) to final-answer-only tokens for the same task. R = 1 for non-reasoning models. R can exceed 100 for reasoning models that perform extensive search.
ρ̄ is the average rollout-or-rejection ratio under verifier-guided decoding (best-of-N, MCTS-at-decode, self-consistency). For a model that samples once, ρ̄ = 0. For a system that samples 16 candidates and verifies, ρ̄ ≈ 15.
α(θ, V) is the verification accept rate at quality threshold θ on verifier V.

The Verified Capability per Dollar framework introduced in Field Notes #1 is the special case R → 1, ρ̄ → 0, α → 1. Cost-correct extends VCpD by making the reasoning, rollout, and verification terms first-class denominators of the unit.

Why this matters in 2026

Three observable shifts justify the new unit.

First, reasoning is billed as output tokens. Across every major lab’s public pricing schedule as of May 2026, internally generated chain-of-thought tokens are charged at the standard output rate. A reasoning model that emits a 50,000-token chain-of-thought before a 500-token final answer is a 100-to-1 reasoning-to-answer ratio billed entirely at the output rate.

Second, the multiplier is large and variable. Recent benchmarks (OckBench, arXiv:2511.05722) measure up to a 5x token-efficiency dispersion between reasoning models that achieve similar accuracy on the same problem.

Third, the ARC-AGI-2 leaderboard shows a 70x-plus cost-per-task spread across published frontier configurations at near-equivalent accuracy. The dispersion is verification-conditional, not capability-conditional.

The lever

CPM compresses through stack-level engineering: quantization, kernels, runtime, hardware. α (the verification accept rate) compresses through training-side and inference-side verifier engineering. The two are different disciplines.

Training-side verifiers concentrate capital into RL with verifiable rewards (RLVR, named in Tulu 3, Lambert et al. 2024) and process reward models (PRM800K, Lightman et al. 2023). DeepSeek-R1 (Nature 645:633-638) is the canonical demonstration of pure-RL reasoning with verifiable rewards.

Inference-side verifiers include best-of-N selection, self-consistency over sampled paths, Monte Carlo Tree Search at decode time (rStar-Math, Guan et al. 2025), and self-evaluation in Tree of Thoughts (Yao et al., NeurIPS 2023).

The economic claim is that a small generator coupled to a small verifier can beat a large monolithic reasoning model on a per-task-cost basis. rStar-Math improves Phi3-mini-3.8B’s MATH accuracy from 41.4% to 86.4% by routing through a process preference model, surpassing o1-preview at small scale.

Production guidance

Report Cost-correct, not CPM, when communicating production economics.
Specify the verifier alongside the model: any “X% accuracy at $Y per task” claim is incomplete without naming the verifier under which X is measured.
Track α as a first-class production metric. A regression in α is a more expensive failure than a CPM spike.
Treat the verifier as a deployable artifact: versioned, evaluated, monitored for drift, often smaller and quantized, often deployable on-device.

The Cost of Being Right. Verification Economics in 2026.. Field note that introduces the framework with full bibliography and PDF.
The Inference Stack in 2026. Defines Verified Capability per Dollar (VCpD); Cost-correct extends it.
AWQ quantization. One of the four CPM levers.
Speculative decoding. Decoding-time parallelism lever.

References

Bhardwaj, M. The Cost of Being Right: Verification Economics in 2026. ifitsmanu.com, May 2026. link
Lambert, N. et al. Tulu 3: Pushing Frontiers in Open Language Model Post-Training. arXiv:2411.15124, 2024. link
Lightman, H. et al. Let’s Verify Step by Step. arXiv:2305.20050, 2023. link
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025. link
Du, Z. et al. OckBench: Measuring the Efficiency of LLM Reasoning. arXiv:2511.05722, 2025. link

Glossary. Research index. Home.

Verification economics #

Definition #

Why this matters in 2026 #

The lever #

Production guidance #

Related #

References #