The Inference-Time Compute Frontier.
A Cost-Correct Threshold for Training Versus Test-Time Allocation.
Manu Bhardwaj. ifitsmanu.com. May 2026. Version 1.0. Research Paper #2 in the inference-economics wedge.
Download as PDF (full proofs, figures, calibration tables). LaTeX source. BibTeX of references. Cite this article. Papers index.
Companion to the verification-economics field notes. The Cost of Being Right. (Field Notes #2) develops the Cost-correct decomposition. The α Asymmetry. (Field Notes #3) shows verifier accept rate dominates the other cost levers. Verifier Procurement Under Unobservable Quality. (Research Paper #1) closes the gap when the deployer must buy rather than build. This paper answers a different question: given that you are building, when should the next dollar go to more rollouts rather than more training?
Or view the full PDF inline.
Abstract
When does an additional dollar of compute reduce cost-per-correct-answer faster when spent on inference-time scaling than when spent on further training? Snell et al. (2024) and Brown et al. (2024) show that test-time compute can substitute for training compute on hard reasoning tasks, and Guan et al. (2025) show that verifier-guided rollouts let small models match flagship reasoners. What none of them give is an economic threshold that says where the substitution holds. We derive one. Under the Cost-correct decomposition of The Cost of Being Right, with verifier accept rate parameterized jointly in training compute and rollout count , the marginal dollar reduces cost-per-correct-answer faster on the inference channel iff exceeds the inference-to-training dollar ratio at the operating point. We calibrate the threshold against rStar-Math, DeepSeek-R1, and the published test-time-compute curves of Snell et al. (2024) and Brown et al. (2024), and show that the calibration matches the observed market split between frontier reasoning tiers and commodity tiers.
1. Introduction
Frontier reasoning models in 2025 ship with explicit thinking budgets. rStar-Math couples a 7B generator with a 7B process-reward verifier and Monte-Carlo Tree Search rollouts to beat o1-preview on AIME 2024 and MATH at a fraction of the inference dollar (Guan et al., 2025). DeepSeek-R1 lifts pass@1 on the same benchmarks through reinforcement learning with verifiable-reward signals at fixed rollout count (DeepSeek-AI, 2025). OpenAI’s o-series and the GPT-5.5 launch in April 2026 advertise per-query reasoning budgets as a first-class API parameter. Commodity tiers do not. GPT-5.4 nano, Gemini Flash, and Claude Haiku 4.5 ship without rollout budgets and serve a workload mix dominated by retrieval and short-form generation.
Two features of this split are striking. First, the split is sharp. There is no continuous gradient of “small thinking budget” tiers in the market; either a model deploys explicit inference-time scaling or it does not. Second, the split is recent. As late as 2024, even frontier providers shipped without dedicated rollout budgets, and the available economic frame was the Chinchilla compute-optimal training-data ratio of Hoffmann et al. (2022). The frame has since shifted to a question that the Chinchilla setup does not answer: where on the joint training-and-inference frontier should the next compute dollar go?
Snell et al. (2024) and Brown et al. (2024) answer the related question of substitutability but not the question of allocation. Snell et al. show on PaLM-2 that test-time compute can replace 14 times more pre-training compute on hard reasoning subsets. Brown et al. show on Llama-class models that pass@k under repeated sampling scales as an exponential decay in compute and that the curve crosses the parameter-scaling curve at a benchmark-dependent crossover. Both papers fix the verifier and report accuracy versus compute curves. Neither casts the result as a cost-allocation problem with explicit verifier construction cost, and neither isolates the conditions under which substitution holds.
This paper supplies the missing economic threshold. The contribution is a closed-form condition under which the marginal dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel, expressed in three observable parameters: the elasticity of verifier accept rate with respect to rollout count, the elasticity of accept rate with respect to training compute, and the inference-to-training dollar ratio at the operating point. The threshold derives from the Cost-correct decomposition and the verifier-dominance result of The α Asymmetry. It requires modeling the verifier accept rate as a joint function of both compute channels, taking partial derivatives in both, and identifying a closed-form switching condition.
We calibrate the threshold against four operating points. The threshold is crossed at the hard-difficulty subsets reported by Snell et al. and Brown et al.; it is not crossed at the easy subsets in the same papers, nor at the workload mixes implied by commodity-tier deployments. rStar-Math holds fixed at 7B and runs past the cost-correct optimum to chase headline accuracy on AIME 2024. DeepSeek-R1 sits at where the threshold predicts the inference channel cannot clear the bar given the very high the verifiable-reward RL stage realizes on the V3 base. The pattern matches the observed market split.
2. Related work
Inference-time scaling. Snell et al. (2024) study optimal allocation of test-time compute across rollouts, revisions, and search depth on PaLM-2. Brown et al. (2024) study repeated sampling on Llama and Pythia across HumanEval, MATH, GSM8K, and MiniF2F. Both papers hold the verifier fixed and treat it as an oracle. Neither incorporates verifier construction cost or partitions a budget across the training and inference channels.
Cost-of-pass and cost-correct. Erol et al. (2026) introduce Cost-of-Pass as a per-accepted-correct-answer metric. The Cost of Being Right develops the multiplicative Cost-correct decomposition that separates cost-per-million-tokens, the reasoning multiplier, the rollout ratio, and the verifier accept rate. The α Asymmetry shows the partial derivative of Cost-correct with respect to dominates the other partials in production regimes. None study allocation across training and inference channels.
Compute-optimal training. Kaplan et al. (2020) and Hoffmann et al. (2022) establish single-channel scaling laws. The Chinchilla frontier optimizes training compute at a single inference operating point. It does not extend to a regime in which the next dollar can be allocated to inference-time rollouts that lift verifier accept rate.
A separate body of work on outcome- and process-reward verifiers (Cobbe et al. (2021) introduced outcome-reward verifiers on GSM8K; Lightman et al. (2023) drew the explicit ORM-vs-PRM distinction and showed step-level process-reward signals dominate on MATH) and verifier-guided decoding (Guan et al., 2025) supplies the empirical content of the elasticity calibrations in Section 4.
3. Method
3.1. Cost-correct, restated
We work in the Cost-correct framework. The unit cost of a correct answer is
where is the blended cost per million tokens at a unit input-to-output ratio, is the reasoning multiplier (output tokens per accepted answer), is the average rollout ratio, and is the verifier accept rate. The α-asymmetry result establishes that
with equality approached in the high-rollout limit , where . This asymmetry makes verifier accept rate the natural pivot for a two-channel allocation rule.
3.2. Two-channel parameterization
Let denote post-training compute spent on the generator (in FLOP-units) and denote rollout count per query. We parameterize the verifier accept rate as
where is the base accept rate of an unfiltered single rollout and is the verifier lift from selecting the best of rollouts under a fixed verifier. We adopt the separability assumption
and define the elasticities
Under (4) the cross-partial vanishes. Separability is justified empirically when verifier-guided selection acts on a fixed generator distribution that has already absorbed the post-training lift, as in best-of-N reranking with a frozen process-reward model.
3.3. Cost ratio and budget constraint
Let denote the marginal cost of one unit of post-training FLOP, amortized over the expected query lifetime , and the marginal cost of one unit of inference FLOP per query. Define
Under public price points and the price-of-progress dataset of Liao et al. (2025), at the frontier operating point in 2026 is on the order of to per query when amortized over a generator’s commercial lifetime. The operationally relevant quantity is the dollar ratio at the operating point.
3.4. The threshold theorem
The inference channel clears the threshold when the rollout-net-of-cost elasticity ratio exceeds the inference-to-training dollar ratio at the operating point.
We state the result in the rollout-dominant regime where so that . The general statement appears in Appendix A.
Theorem (Threshold). At an interior operating point with under separability (4) and the cost ratio (6), the marginal dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel iff
Proof. Take logs of (1). The fractional reduction in from a 1% increase in is , at a dollar cost of . The fractional reduction in from a 1% increase in is , at a dollar cost of . Per-dollar log-reductions:
The inference channel dominates iff . Cross-multiplying gives (7).
The theorem partitions the plane into a training-dominated region and an inference-dominated region. The optimum lies on the boundary, where (7) holds with equality. The right-hand side is observable from the deployment cost ledger. The left-hand side is the rollout-net-of-cost elasticity ratio: it credits rollouts only for the lift in above the per-rollout cost , which in the rollout-dominant regime is unity.
3.5. Comparative statics
Three corollaries follow directly from (7).
Corollary 1 (frontier ceiling). As at fixed verifier, . The right-hand side of (7) is bounded; the left-hand side grows without bound. Frontier-difficulty subsets satisfy the threshold; easy subsets do not.
Corollary 2 (reasoning multiplier). Tasks with high magnify the absolute dollar return to either channel. Combined with the α-asymmetry result, reasoning-heavy workloads favor inference-time allocation; retrieval-heavy workloads do not.
Corollary 3 (amortization). When is large, falls and rises, so the inference channel must clear a lower bar to dominate. This predicts that high-throughput commodity tiers serving long-lived workloads do not deploy thinking budgets, because the cost-per-correct-answer reduction from rollouts on easy tasks is too small to clear even the lowered bar.
4. Experiments
This section calibrates the threshold (7) against four operating points. All numbers are cited from primary sources; we report no new measurements.
4.1. rStar-Math (Microsoft Research, January 2025)
Guan et al. (2025) report a Qwen2.5-Math-7B generator paired with a 7B process-reward verifier and MCTS rollouts. The deployed configuration runs rollouts per query, reporting pass@1 of on AIME 2024 and on MATH-500.
The secant elasticity over the in-MCTS sweep on AIME 2024 is ; on MATH-500 it is .
Substituting into (7): for any positive . The inference channel does not clear the threshold at the deployed . rStar-Math optimized headline accuracy at fixed model scale, not cost-per-correct-answer; the deployed configuration sits inside the verifier-ceiling regime (Corollary 1). A cost-conscious redeployment would run at materially lower , trading accuracy for cost-per-correct-answer reduction.
4.2. DeepSeek-R1 (DeepSeek-AI, January 2025)
DeepSeek-AI (2025) lift pass@1 on AIME 2024 from (DeepSeek-V3 base) to (DeepSeek-R1) through RL with verifiable-reward signals at fixed rollout count ().
DeepSeek does not disclose RL post-training compute as a fraction of V3 pre-training. Under a sensitivity bracket , the implied training-channel elasticity on AIME 2024 is .
For the inference channel to clear (7) at R1 would require , meaning to , implausible for any published verifier on AIME 2024. The corner solution is therefore consistent with (7) across the full sensitivity bracket.
4.3. Test-time-compute curves (Snell et al. 2024; Brown et al. 2024)
The hard-subset regime in Snell et al. (2024) corresponds to far from 1 and in the 0.5–1.0 range. The 14× substitution result implies , exactly the threshold (7) in its form. The easy-subset regime corresponds to and , where the threshold flips.
Brown et al. (2024) report the same pattern in pass@k form on Llama and Pythia. On hard benchmarks (MiniF2F, MATH-hard subsets), the exponent (the local in our notation) is large and the substitution holds; on easy benchmarks the exponent is small and the substitution breaks. The crossover occurs precisely where , which is (7) with equality.
4.4. Negative case: commodity tiers
At on routine-task workloads (short-form generation, retrieval, classification), is bounded above by . The right-hand side of (7) is order unity. The threshold fails by an order of magnitude.
The prediction is that commodity tiers should not deploy explicit thinking budgets. They do not. The same prediction explains the absence of a continuous gradient of small-thinking-budget tiers between commodity and frontier.
| Operating point | $\eta_\alpha^\rho$ (AIME 2024) | Threshold crossed? | Deployment fact |
|---|---|---|---|
| rStar-Math, $\rho = 64$ | 0.031 (secant) | No: $(\eta_\alpha^\rho - 1) < 0$ | Fixed $T$, accuracy-optimized |
| DeepSeek-R1, $\rho = 1$ | N/A ($\rho = 1$) | Consistent: corner $\rho = 1$ | Training-channel RL allocation |
| Snell et al. hard subsets | 0.5–1.0 | Yes: 14× substitution | Test-time compute dominant |
| Commodity tiers | $< 0.05$ | No: $\alpha_0 > 0.95$ | No rollout budget deployed |
5. Discussion
5.1. Capital allocation across the two channels
The threshold (7) gives a quantitative rule for where the next compute dollar should go. Frontier providers facing hard-reasoning workloads should mix, allocating to both channels along the boundary defined by equality in (7). Commodity providers facing easy-task workloads should allocate to the training channel only.
The observed market structure matches both predictions. The 2026 reasoning tier ships with thinking budgets that are themselves a tunable parameter: evidence that the provider sits on the boundary and lets the customer pick the operating point. Commodity tiers ship without rollout budgets at all: evidence that the provider sits well inside the training-dominated region.
5.2. The GPT-5.5 reprice as a falsifiable hypothesis
OpenAI raised GPT-5.5 prices by 100% over GPT-5.4 in April 2026. Under (1), a price increase at fixed requires a fall in , a rise in , or a rise in . The threshold theorem rationalizes the move only if the GPT-5.5 workload mix has shifted toward harder reasoning tasks where the inference-channel allocation share has risen. This is consistent with OpenAI’s published statement on GPT-5.5 thinking budgets.
The hypothesis is falsifiable: if a future GPT-5.5 disclosure shows flat or falling rollout share on a workload mix shifted toward easy tasks, a competing explanation is required.
5.3. Limitations
Separability assumption. If depends materially on (the verifier and generator have not absorbed each other’s progress), the cross-partial does not vanish and (7) holds only locally. A broader calibration tracing non-separability is future work.
Fixed verifier construction cost. We have treated verifier construction cost as amortized over the verifier’s lifetime. If the verifier does not transfer across tasks, the fixed-cost approximation breaks and the threshold shifts toward training-channel allocation.
Three-point calibration. A population-level calibration with the full 2024–2026 reasoning-model release sequence would tighten the elasticity estimates.
6. Conclusion
Inference-time scaling and training compute are substitutes on hard reasoning tasks, but allocation is a different question from substitutability. We have derived a closed-form threshold under the Cost-correct decomposition that says when the next dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel, calibrated the threshold against four operating points, and shown that the calibration matches the observed market split between frontier reasoning and commodity tiers.
The next paper in the sequence relaxes the separability assumption by treating verifier portability as the primary object of study.
Appendix A. Full proof of the threshold theorem
Theorem (Threshold, general). At an interior operating point under separability (4) and cost ratio (6), the marginal dollar reduces faster on the inference channel iff
Proof. From differentiating (8),
The fractional change in per dollar on the training channel is . The fractional change in per dollar on the inference channel is . Setting the inference rate strictly greater than the training rate and rearranging gives (A.1). The limit gives , recovering (7).
Corollary (boundary curvature). The boundary surface where (A.1) holds with equality is concave in the rollout-dominant regime; the iso-cost-correct curves in the same plane are convex; the optimum lies at the unique tangent point.
Appendix B. Calibration tables
Table B.1. rStar-Math operating points. Source: Guan et al. (2025), Table 5.
| Model | Benchmark | $\rho$ | pass@1 | Notes |
|---|---|---|---|---|
| Qwen2.5-Math-7B base | AIME 2024 | 1 | 0.000 | base generator, no MCTS |
| Qwen2.5-Math-7B base | MATH-500 | 1 | 0.588 | base generator, no MCTS |
| rStar-Math (7B + 7B PRM) | AIME 2024 | 8 | 0.500 | in-MCTS |
| rStar-Math (7B + 7B PRM) | MATH-500 | 8 | 0.894 | in-MCTS |
| rStar-Math (7B + 7B PRM) | AIME 2024 | 64 | 0.533 | deployed |
| rStar-Math (7B + 7B PRM) | MATH-500 | 64 | 0.900 | deployed |
Table B.2. DeepSeek-R1 vs DeepSeek-V3 base at . Source: DeepSeek-AI (2025), Table 4.
| Model | Benchmark | $\rho$ | pass@1 | Notes |
|---|---|---|---|---|
| DeepSeek-V3 base | AIME 2024 | 1 | 0.392 | Pre-RL baseline |
| DeepSeek-R1-Zero | AIME 2024 | 1 | 0.710 | Pure RL, no SFT |
| DeepSeek-R1 | AIME 2024 | 1 | 0.798 | Post verifiable-reward RL |
| DeepSeek-V3 base | MATH-500 | 1 | 0.902 | Pre-RL baseline |
| DeepSeek-R1 | MATH-500 | 1 | 0.973 | Post verifiable-reward RL |
Table B.3. Snell et al. (2024) headline substitution result on PaLM-2-S MATH subsets.
| Subset | Substitution ratio (test-time / pre-training) | Threshold prediction |
|---|---|---|
| Hard MATH | 14× | Crosses threshold |
| Easy MATH | <1× | Does not cross |
Table B.4. Commodity-tier deployments (negative case). Source: Field Notes #1.
| Model | Workload | $\bar\rho$ deployed | $\alpha$ on workload |
|---|---|---|---|
| GPT-5.4 nano | Retrieval / short-form | 1 | >0.95 |
| Gemini Flash | Retrieval / short-form | 1 | >0.95 |
| Claude Haiku 4.5 | Retrieval / short-form | 1 | >0.95 |
References
- Bhardwaj, M. The Cost of Being Right. Verification Economics in 2026. Field Notes #2. ifitsmanu.com, 2026.
- Bhardwaj, M. The α Asymmetry. Why Verifiers Can Be Smaller Than Generators. Field Notes #3. ifitsmanu.com, 2026.
- Bhardwaj, M. The Inference Stack in 2026. Field Notes #1. ifitsmanu.com, 2026.
- Brown, B. et al. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. arXiv:2407.21787, 2024.
- Cobbe, K. et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168, 2021.
- DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025.
- Erol, U. et al. The Cost of Being Right: Evaluating Language Models by the Cost-of-Pass. ICLR 2026.
- Guan, X. et al. rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv:2501.04519, 2025.
- Hoffmann, J. et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556, 2022.
- Kaplan, J. et al. Scaling Laws for Neural Language Models. arXiv:2001.08361, 2020.
- Liao, Y. et al. The Price of Progress: Tracking the Declining Cost of Computing, AI, and Other Transformative Technologies. arXiv:2511.23455, 2025.
- Lightman, H. et al. Let’s Verify Step by Step. arXiv:2305.20050, 2023.
- Snell, C. et al. Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters. arXiv:2408.03314, 2024.
- Stanford Human-Centered AI Institute. AI Index Report 2025. Stanford University, 2025.
Cite this article
@misc{bhardwaj2026inferencetimefrontier,
author = {Bhardwaj, Manu},
title = {The Inference-Time Compute Frontier: A Cost-Correct Threshold for Training Versus Test-Time Allocation},
year = {2026},
month = {May},
url = {https://ifitsmanu.com/papers/inference-frontier},
howpublished = {\url{https://ifitsmanu.com/papers/inference-frontier/paper.pdf}},
note = {Working paper. Version 1.0.}
}
Bhardwaj, M. (2026, May). The inference-time compute frontier: A cost-correct threshold for training versus test-time allocation. ifitsmanu.com. https://ifitsmanu.com/papers/inference-frontier
Bhardwaj, Manu. "The Inference-Time Compute Frontier: A Cost-Correct Threshold for Training Versus Test-Time Allocation." ifitsmanu.com, May 2026. https://ifitsmanu.com/papers/inference-frontier.
M. Bhardwaj, "The Inference-Time Compute Frontier: A Cost-Correct Threshold for Training Versus Test-Time Allocation," ifitsmanu.com, May 2026. [Online]. Available: https://ifitsmanu.com/papers/inference-frontier
Companion. The Cost of Being Right. Companion. The α Asymmetry. Companion. Verifier Procurement. Papers index. Home.