Manu Bhardwaj · Papers

The Inference-Time Compute Frontier.

A Cost-Correct Threshold for Training Versus Test-Time Allocation.

Manu Bhardwaj. ifitsmanu.com. May 2026. Version 1.0. Research Paper #2 in the inference-economics wedge.

Download as PDF (full proofs, figures, calibration tables). LaTeX source. BibTeX of references. Cite this article. Papers index.

Companion to the verification-economics field notes. The Cost of Being Right. (Field Notes #2) develops the Cost-correct decomposition. The α Asymmetry. (Field Notes #3) shows verifier accept rate dominates the other cost levers. Verifier Procurement Under Unobservable Quality. (Research Paper #1) closes the gap when the deployer must buy rather than build. This paper answers a different question: given that you are building, when should the next dollar go to more rollouts rather than more training?

Or view the full PDF inline.

Abstract

When does an additional dollar of compute reduce cost-per-correct-answer faster when spent on inference-time scaling than when spent on further training? Snell et al. (2024) and Brown et al. (2024) show that test-time compute can substitute for training compute on hard reasoning tasks, and Guan et al. (2025) show that verifier-guided rollouts let small models match flagship reasoners. What none of them give is an economic threshold that says where the substitution holds. We derive one. Under the Cost-correct decomposition of The Cost of Being Right, with verifier accept rate parameterized jointly in training compute TT and rollout count ρ\rho, the marginal dollar reduces cost-per-correct-answer faster on the inference channel iff (ηαρ1)/ηαT(\eta_\alpha^\rho - 1)/\eta_\alpha^T exceeds the inference-to-training dollar ratio at the operating point. We calibrate the threshold against rStar-Math, DeepSeek-R1, and the published test-time-compute curves of Snell et al. (2024) and Brown et al. (2024), and show that the calibration matches the observed market split between frontier reasoning tiers and commodity tiers.


1. Introduction

Frontier reasoning models in 2025 ship with explicit thinking budgets. rStar-Math couples a 7B generator with a 7B process-reward verifier and Monte-Carlo Tree Search rollouts to beat o1-preview on AIME 2024 and MATH at a fraction of the inference dollar (Guan et al., 2025). DeepSeek-R1 lifts pass@1 on the same benchmarks through reinforcement learning with verifiable-reward signals at fixed rollout count (DeepSeek-AI, 2025). OpenAI’s o-series and the GPT-5.5 launch in April 2026 advertise per-query reasoning budgets as a first-class API parameter. Commodity tiers do not. GPT-5.4 nano, Gemini Flash, and Claude Haiku 4.5 ship without rollout budgets and serve a workload mix dominated by retrieval and short-form generation.

Two features of this split are striking. First, the split is sharp. There is no continuous gradient of “small thinking budget” tiers in the market; either a model deploys explicit inference-time scaling or it does not. Second, the split is recent. As late as 2024, even frontier providers shipped without dedicated rollout budgets, and the available economic frame was the Chinchilla compute-optimal training-data ratio of Hoffmann et al. (2022). The frame has since shifted to a question that the Chinchilla setup does not answer: where on the joint training-and-inference frontier should the next compute dollar go?

Snell et al. (2024) and Brown et al. (2024) answer the related question of substitutability but not the question of allocation. Snell et al. show on PaLM-2 that test-time compute can replace 14 times more pre-training compute on hard reasoning subsets. Brown et al. show on Llama-class models that pass@k under repeated sampling scales as an exponential decay in compute and that the curve crosses the parameter-scaling curve at a benchmark-dependent crossover. Both papers fix the verifier and report accuracy versus compute curves. Neither casts the result as a cost-allocation problem with explicit verifier construction cost, and neither isolates the conditions under which substitution holds.

This paper supplies the missing economic threshold. The contribution is a closed-form condition under which the marginal dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel, expressed in three observable parameters: the elasticity of verifier accept rate with respect to rollout count, the elasticity of accept rate with respect to training compute, and the inference-to-training dollar ratio at the operating point. The threshold derives from the Cost-correct decomposition and the verifier-dominance result of The α Asymmetry. It requires modeling the verifier accept rate as a joint function of both compute channels, taking partial derivatives in both, and identifying a closed-form switching condition.

We calibrate the threshold against four operating points. The threshold is crossed at the hard-difficulty subsets reported by Snell et al. and Brown et al.; it is not crossed at the easy subsets in the same papers, nor at the workload mixes implied by commodity-tier deployments. rStar-Math holds TT fixed at 7B and runs ρ\rho past the cost-correct optimum to chase headline accuracy on AIME 2024. DeepSeek-R1 sits at ρ=1\rho = 1 where the threshold predicts the inference channel cannot clear the bar given the very high ηαT\eta_\alpha^T the verifiable-reward RL stage realizes on the V3 base. The pattern matches the observed market split.


Inference-time scaling. Snell et al. (2024) study optimal allocation of test-time compute across rollouts, revisions, and search depth on PaLM-2. Brown et al. (2024) study repeated sampling on Llama and Pythia across HumanEval, MATH, GSM8K, and MiniF2F. Both papers hold the verifier fixed and treat it as an oracle. Neither incorporates verifier construction cost or partitions a budget across the training and inference channels.

Cost-of-pass and cost-correct. Erol et al. (2026) introduce Cost-of-Pass as a per-accepted-correct-answer metric. The Cost of Being Right develops the multiplicative Cost-correct decomposition that separates cost-per-million-tokens, the reasoning multiplier, the rollout ratio, and the verifier accept rate. The α Asymmetry shows the partial derivative of Cost-correct with respect to α\alpha dominates the other partials in production regimes. None study allocation across training and inference channels.

Compute-optimal training. Kaplan et al. (2020) and Hoffmann et al. (2022) establish single-channel scaling laws. The Chinchilla frontier optimizes training compute at a single inference operating point. It does not extend to a regime in which the next dollar can be allocated to inference-time rollouts that lift verifier accept rate.

A separate body of work on outcome- and process-reward verifiers (Cobbe et al. (2021) introduced outcome-reward verifiers on GSM8K; Lightman et al. (2023) drew the explicit ORM-vs-PRM distinction and showed step-level process-reward signals dominate on MATH) and verifier-guided decoding (Guan et al., 2025) supplies the empirical content of the elasticity calibrations in Section 4.


3. Method

3.1. Cost-correct, restated

We work in the Cost-correct framework. The unit cost of a correct answer is

C  =  CPM1:1R(1+ρˉ)α,(1)C \;=\; \frac{\mathrm{CPM}_{1:1} \cdot R \cdot (1 + \bar\rho)}{\alpha}, \qquad (1)

where CPM1:1\mathrm{CPM}_{1:1} is the blended cost per million tokens at a unit input-to-output ratio, RR is the reasoning multiplier (output tokens per accepted answer), ρˉ\bar\rho is the average rollout ratio, and α(0,1]\alpha \in (0, 1] is the verifier accept rate. The α-asymmetry result establishes that

logClogα  =  1    logClogx,x{CPM1:1,R,ρˉ},(2)\Big| \tfrac{\partial \log C}{\partial \log \alpha} \Big| \;=\; 1 \;\geq\; \Big| \tfrac{\partial \log C}{\partial \log x} \Big|, \qquad x \in \{\mathrm{CPM}_{1:1}, R, \bar\rho\}, \qquad (2)

with equality approached in the high-rollout limit ρˉ\bar\rho \to \infty, where logC/logρˉ=ρˉ/(1+ρˉ)1\partial \log C / \partial \log \bar\rho = \bar\rho/(1+\bar\rho) \to 1. This asymmetry makes verifier accept rate the natural pivot for a two-channel allocation rule.

3.2. Two-channel parameterization

Let TT denote post-training compute spent on the generator (in FLOP-units) and ρ\rho denote rollout count per query. We parameterize the verifier accept rate as

α(T,ρ)  =  g(α0(T),h(ρ)),(3)\alpha(T, \rho) \;=\; g\bigl(\alpha_0(T),\, h(\rho)\bigr), \qquad (3)

where α0(T)\alpha_0(T) is the base accept rate of an unfiltered single rollout and h(ρ)h(\rho) is the verifier lift from selecting the best of ρ\rho rollouts under a fixed verifier. We adopt the separability assumption

logα(T,ρ)  =  logα0(T)+h(ρ),(4)\log \alpha(T, \rho) \;=\; \log \alpha_0(T) + h(\rho), \qquad (4)

and define the elasticities

ηαT    logαlogT,ηαρ    logαlogρ.(5)\eta_\alpha^T \;\equiv\; \frac{\partial \log \alpha}{\partial \log T}, \qquad \eta_\alpha^\rho \;\equiv\; \frac{\partial \log \alpha}{\partial \log \rho}. \qquad (5)

Under (4) the cross-partial 2logα/logTlogρ\partial^2 \log \alpha / \partial \log T \, \partial \log \rho vanishes. Separability is justified empirically when verifier-guided selection acts on a fixed generator distribution that has already absorbed the post-training lift, as in best-of-N reranking with a frozen process-reward model.

3.3. Cost ratio and budget constraint

Let cTc_T denote the marginal cost of one unit of post-training FLOP, amortized over the expected query lifetime QQ, and cIc_I the marginal cost of one unit of inference FLOP per query. Define

ν    cTcI.(6)\nu \;\equiv\; \frac{c_T}{c_I}. \qquad (6)

Under public price points and the price-of-progress dataset of Liao et al. (2025), ν\nu at the frontier operating point in 2026 is on the order of 10510^{-5} to 10410^{-4} per query when amortized over a generator’s commercial lifetime. The operationally relevant quantity is the dollar ratio μ(TcT)/(ρcI)\mu \equiv (T \cdot c_T) / (\rho \cdot c_I) at the operating point.

3.4. The threshold theorem

The inference channel clears the threshold when the rollout-net-of-cost elasticity ratio exceeds the inference-to-training dollar ratio at the operating point.

We state the result in the rollout-dominant regime where ρ1\rho \gg 1 so that (1+ρ)ρ(1 + \rho) \approx \rho. The general statement appears in Appendix A.

Theorem (Threshold). At an interior operating point (T,ρ)(T, \rho) with ρ1\rho \gg 1 under separability (4) and the cost ratio (6), the marginal dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel iff

ηαρ    1ηαT  >  ρcITcT  =  1μ.(7)\frac{\eta_\alpha^\rho \;-\; 1}{\eta_\alpha^T} \;>\; \frac{\rho \cdot c_I}{T \cdot c_T} \;=\; \frac{1}{\mu}. \qquad (7)

Proof. Take logs of (1). The fractional reduction in CC from a 1% increase in TT is ηαT\eta_\alpha^T, at a dollar cost of 0.01TcT0.01 \cdot T \cdot c_T. The fractional reduction in CC from a 1% increase in ρ\rho is ηαρ1\eta_\alpha^\rho - 1, at a dollar cost of 0.01ρcI0.01 \cdot \rho \cdot c_I. Per-dollar log-reductions:

gT  =  ηαTTcT,gρ  =  ηαρ1ρcI.(10)g_T \;=\; \frac{\eta_\alpha^T}{T \cdot c_T}, \qquad g_\rho \;=\; \frac{\eta_\alpha^\rho - 1}{\rho \cdot c_I}. \qquad (10)

The inference channel dominates iff gρ>gTg_\rho > g_T. Cross-multiplying gives (7). \square

The theorem partitions the (T,ρ)(T, \rho) plane into a training-dominated region and an inference-dominated region. The optimum lies on the boundary, where (7) holds with equality. The right-hand side is observable from the deployment cost ledger. The left-hand side is the rollout-net-of-cost elasticity ratio: it credits rollouts only for the lift in α\alpha above the per-rollout cost ρ/(1+ρ)\rho/(1+\rho), which in the rollout-dominant regime is unity.

3.5. Comparative statics

Three corollaries follow directly from (7).

Corollary 1 (frontier ceiling). As α01\alpha_0 \to 1 at fixed verifier, ηαT0\eta_\alpha^T \to 0. The right-hand side of (7) is bounded; the left-hand side grows without bound. Frontier-difficulty subsets satisfy the threshold; easy subsets do not.

Corollary 2 (reasoning multiplier). Tasks with high RR magnify the absolute dollar return to either channel. Combined with the α-asymmetry result, reasoning-heavy workloads favor inference-time allocation; retrieval-heavy workloads do not.

Corollary 3 (amortization). When QQ is large, ν\nu falls and μ\mu rises, so the inference channel must clear a lower bar to dominate. This predicts that high-throughput commodity tiers serving long-lived workloads do not deploy thinking budgets, because the cost-per-correct-answer reduction from rollouts on easy tasks is too small to clear even the lowered bar.


4. Experiments

This section calibrates the threshold (7) against four operating points. All numbers are cited from primary sources; we report no new measurements.

4.1. rStar-Math (Microsoft Research, January 2025)

Guan et al. (2025) report a Qwen2.5-Math-7B generator paired with a 7B process-reward verifier and MCTS rollouts. The deployed configuration runs ρ=64\rho = 64 rollouts per query, reporting pass@1 of 0.5330.533 on AIME 2024 and 0.9000.900 on MATH-500.

The secant elasticity over the in-MCTS sweep ρ=864\rho = 8 \to 64 on AIME 2024 is log(0.533/0.500)/log(64/8)0.031\log(0.533/0.500)/\log(64/8) \approx 0.031; on MATH-500 it is log(0.900/0.894)/log(64/8)0.003\log(0.900/0.894)/\log(64/8) \approx 0.003.

Substituting into (7): (ηαρ1)/ηαT=(0.0311)/ηαT0.97/ηαT<0(\eta_\alpha^\rho - 1)/\eta_\alpha^T = (0.031 - 1)/\eta_\alpha^T \approx -0.97/\eta_\alpha^T < 0 for any positive ηαT\eta_\alpha^T. The inference channel does not clear the threshold at the deployed ρ=64\rho = 64. rStar-Math optimized headline accuracy at fixed model scale, not cost-per-correct-answer; the deployed configuration sits inside the verifier-ceiling regime (Corollary 1). A cost-conscious redeployment would run at materially lower ρ\rho, trading accuracy for cost-per-correct-answer reduction.

4.2. DeepSeek-R1 (DeepSeek-AI, January 2025)

DeepSeek-AI (2025) lift pass@1 on AIME 2024 from 0.3920.392 (DeepSeek-V3 base) to 0.7980.798 (DeepSeek-R1) through RL with verifiable-reward signals at fixed rollout count (ρ=1\rho = 1).

DeepSeek does not disclose RL post-training compute as a fraction of V3 pre-training. Under a sensitivity bracket s=ΔT/TV3[0.01,0.10]s = \Delta T / T_{V3} \in [0.01, 0.10], the implied training-channel elasticity on AIME 2024 is log(0.798/0.392)/log(1+s)[7.5,71]\log(0.798/0.392)/\log(1+s) \in [7.5, 71].

For the inference channel to clear (7) at R1 would require (ηαρ1)/ηαT>1/μ(\eta_\alpha^\rho - 1)/\eta_\alpha^T > 1/\mu, meaning ηαρ8\eta_\alpha^\rho \gtrsim 8 to 7171, implausible for any published verifier on AIME 2024. The corner solution ρ=1\rho = 1 is therefore consistent with (7) across the full sensitivity bracket.

4.3. Test-time-compute curves (Snell et al. 2024; Brown et al. 2024)

The hard-subset regime in Snell et al. (2024) corresponds to α0\alpha_0 far from 1 and ηαρ\eta_\alpha^\rho in the 0.5–1.0 range. The 14× substitution result implies ηαρμηαT\eta_\alpha^\rho \cdot \mu \gg \eta_\alpha^T, exactly the threshold (7) in its ηαρ1\eta_\alpha^\rho \gg 1 form. The easy-subset regime corresponds to α01\alpha_0 \to 1 and ηαρ0\eta_\alpha^\rho \to 0, where the threshold flips.

Brown et al. (2024) report the same pattern in pass@k form on Llama and Pythia. On hard benchmarks (MiniF2F, MATH-hard subsets), the exponent (the local ηαρ\eta_\alpha^\rho in our notation) is large and the substitution holds; on easy benchmarks the exponent is small and the substitution breaks. The crossover occurs precisely where (ηαρ1)/ηαT=1/μ(\eta_\alpha^\rho - 1)/\eta_\alpha^T = 1/\mu, which is (7) with equality.

4.4. Negative case: commodity tiers

At α0>0.95\alpha_0 > 0.95 on routine-task workloads (short-form generation, retrieval, classification), ηαρ\eta_\alpha^\rho is bounded above by 1α0<0.051 - \alpha_0 < 0.05. The right-hand side of (7) is order unity. The threshold fails by an order of magnitude.

The prediction is that commodity tiers should not deploy explicit thinking budgets. They do not. The same prediction explains the absence of a continuous gradient of small-thinking-budget tiers between commodity and frontier.

Table 1. Threshold (7) calibration across four operating points. The threshold is crossed in the hard-reasoning regime and missed in all other cases, matching observed deployment choices.
Operating point$\eta_\alpha^\rho$ (AIME 2024)Threshold crossed?Deployment fact
rStar-Math, $\rho = 64$0.031 (secant)No: $(\eta_\alpha^\rho - 1) < 0$Fixed $T$, accuracy-optimized
DeepSeek-R1, $\rho = 1$N/A ($\rho = 1$)Consistent: corner $\rho = 1$Training-channel RL allocation
Snell et al. hard subsets0.5–1.0Yes: 14× substitutionTest-time compute dominant
Commodity tiers$< 0.05$No: $\alpha_0 > 0.95$No rollout budget deployed

5. Discussion

5.1. Capital allocation across the two channels

The threshold (7) gives a quantitative rule for where the next compute dollar should go. Frontier providers facing hard-reasoning workloads should mix, allocating to both channels along the boundary defined by equality in (7). Commodity providers facing easy-task workloads should allocate to the training channel only.

The observed market structure matches both predictions. The 2026 reasoning tier ships with thinking budgets that are themselves a tunable parameter: evidence that the provider sits on the boundary and lets the customer pick the operating point. Commodity tiers ship without rollout budgets at all: evidence that the provider sits well inside the training-dominated region.

5.2. The GPT-5.5 reprice as a falsifiable hypothesis

OpenAI raised GPT-5.5 prices by 100% over GPT-5.4 in April 2026. Under (1), a price increase at fixed CPM1:1\mathrm{CPM}_{1:1} requires a fall in α\alpha, a rise in RR, or a rise in ρˉ\bar\rho. The threshold theorem rationalizes the move only if the GPT-5.5 workload mix has shifted toward harder reasoning tasks where the inference-channel allocation share has risen. This is consistent with OpenAI’s published statement on GPT-5.5 thinking budgets.

The hypothesis is falsifiable: if a future GPT-5.5 disclosure shows flat or falling rollout share on a workload mix shifted toward easy tasks, a competing explanation is required.

5.3. Limitations

Separability assumption. If ηαρ\eta_\alpha^\rho depends materially on TT (the verifier and generator have not absorbed each other’s progress), the cross-partial does not vanish and (7) holds only locally. A broader calibration tracing non-separability is future work.

Fixed verifier construction cost. We have treated verifier construction cost as amortized over the verifier’s lifetime. If the verifier does not transfer across tasks, the fixed-cost approximation breaks and the threshold shifts toward training-channel allocation.

Three-point calibration. A population-level calibration with the full 2024–2026 reasoning-model release sequence would tighten the elasticity estimates.


6. Conclusion

Inference-time scaling and training compute are substitutes on hard reasoning tasks, but allocation is a different question from substitutability. We have derived a closed-form threshold under the Cost-correct decomposition that says when the next dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel, calibrated the threshold against four operating points, and shown that the calibration matches the observed market split between frontier reasoning and commodity tiers.

The next paper in the sequence relaxes the separability assumption by treating verifier portability as the primary object of study.


Appendix A. Full proof of the threshold theorem

Theorem (Threshold, general). At an interior operating point (T,ρ)(T, \rho) under separability (4) and cost ratio (6), the marginal dollar reduces CC faster on the inference channel iff

ηαρ    ρ1+ρηαT  >  ρcITcT.(A.1)\frac{\eta_\alpha^\rho \;-\; \tfrac{\rho}{1+\rho}}{\eta_\alpha^T} \;>\; \frac{\rho \cdot c_I}{T \cdot c_T}. \qquad (\mathrm{A.1})

Proof. From differentiating (8),

logCT=ηαTT,logCρ=11+ρηαρρ.(A.2)\frac{\partial \log C}{\partial T} = -\frac{\eta_\alpha^T}{T}, \qquad \frac{\partial \log C}{\partial \rho} = \frac{1}{1+\rho} - \frac{\eta_\alpha^\rho}{\rho}. \qquad (\mathrm{A.2})

The fractional change in CC per dollar on the training channel is ηαT/(TcT)\eta_\alpha^T / (T \cdot c_T). The fractional change in CC per dollar on the inference channel is (ηαρ/ρ1/(1+ρ))/cI(\eta_\alpha^\rho/\rho - 1/(1+\rho)) / c_I. Setting the inference rate strictly greater than the training rate and rearranging gives (A.1). The ρ1\rho \gg 1 limit gives ρ/(1+ρ)1\rho/(1+\rho) \to 1, recovering (7). \square

Corollary (boundary curvature). The boundary surface where (A.1) holds with equality is concave in the rollout-dominant regime; the iso-cost-correct curves in the same plane are convex; the optimum lies at the unique tangent point.


Appendix B. Calibration tables

Table B.1. rStar-Math operating points. Source: Guan et al. (2025), Table 5.

ModelBenchmark$\rho$pass@1Notes
Qwen2.5-Math-7B baseAIME 202410.000base generator, no MCTS
Qwen2.5-Math-7B baseMATH-50010.588base generator, no MCTS
rStar-Math (7B + 7B PRM)AIME 202480.500in-MCTS
rStar-Math (7B + 7B PRM)MATH-50080.894in-MCTS
rStar-Math (7B + 7B PRM)AIME 2024640.533deployed
rStar-Math (7B + 7B PRM)MATH-500640.900deployed

Table B.2. DeepSeek-R1 vs DeepSeek-V3 base at ρ=1\rho = 1. Source: DeepSeek-AI (2025), Table 4.

ModelBenchmark$\rho$pass@1Notes
DeepSeek-V3 baseAIME 202410.392Pre-RL baseline
DeepSeek-R1-ZeroAIME 202410.710Pure RL, no SFT
DeepSeek-R1AIME 202410.798Post verifiable-reward RL
DeepSeek-V3 baseMATH-50010.902Pre-RL baseline
DeepSeek-R1MATH-50010.973Post verifiable-reward RL

Table B.3. Snell et al. (2024) headline substitution result on PaLM-2-S MATH subsets.

SubsetSubstitution ratio (test-time / pre-training)Threshold prediction
Hard MATH14×Crosses threshold
Easy MATH<1×Does not cross

Table B.4. Commodity-tier deployments (negative case). Source: Field Notes #1.

ModelWorkload$\bar\rho$ deployed$\alpha$ on workload
GPT-5.4 nanoRetrieval / short-form1>0.95
Gemini FlashRetrieval / short-form1>0.95
Claude Haiku 4.5Retrieval / short-form1>0.95

References

  1. Bhardwaj, M. The Cost of Being Right. Verification Economics in 2026. Field Notes #2. ifitsmanu.com, 2026.
  2. Bhardwaj, M. The α Asymmetry. Why Verifiers Can Be Smaller Than Generators. Field Notes #3. ifitsmanu.com, 2026.
  3. Bhardwaj, M. The Inference Stack in 2026. Field Notes #1. ifitsmanu.com, 2026.
  4. Brown, B. et al. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. arXiv:2407.21787, 2024.
  5. Cobbe, K. et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168, 2021.
  6. DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025.
  7. Erol, U. et al. The Cost of Being Right: Evaluating Language Models by the Cost-of-Pass. ICLR 2026.
  8. Guan, X. et al. rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv:2501.04519, 2025.
  9. Hoffmann, J. et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556, 2022.
  10. Kaplan, J. et al. Scaling Laws for Neural Language Models. arXiv:2001.08361, 2020.
  11. Liao, Y. et al. The Price of Progress: Tracking the Declining Cost of Computing, AI, and Other Transformative Technologies. arXiv:2511.23455, 2025.
  12. Lightman, H. et al. Let’s Verify Step by Step. arXiv:2305.20050, 2023.
  13. Snell, C. et al. Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters. arXiv:2408.03314, 2024.
  14. Stanford Human-Centered AI Institute. AI Index Report 2025. Stanford University, 2025.

Cite this article

@misc{bhardwaj2026inferencetimefrontier,
  author       = {Bhardwaj, Manu},
  title        = {The Inference-Time Compute Frontier: A Cost-Correct Threshold for Training Versus Test-Time Allocation},
  year         = {2026},
  month        = {May},
  url          = {https://ifitsmanu.com/papers/inference-frontier},
  howpublished = {\url{https://ifitsmanu.com/papers/inference-frontier/paper.pdf}},
  note         = {Working paper. Version 1.0.}
}

Companion. The Cost of Being Right. Companion. The α Asymmetry. Companion. Verifier Procurement. Papers index. Home.