The Inference-Time Compute Frontier. A Cost-Correct Threshold for Training Versus Test-Time Allocation.

Q: What is the threshold theorem?

At an interior operating point with many rollouts, the marginal dollar reduces cost-per-correct-answer faster on the inference channel iff (η_α^ρ − 1)/η_α^T > ρ·c_I / T·c_T. The left side is the rollout-net-of-cost elasticity ratio; the right side is the inference-to-training dollar ratio at the operating point, observable from the deployment cost ledger.

Q: Why does rStar-Math run past the cost-correct optimum?

rStar-Math optimized pass@1 on AIME 2024 at fixed model scale, not cost-per-correct-answer. At ρ=64 the secant rollout elasticity is 0.031, making (η_α^ρ − 1) negative. A cost-conscious redeployment would run at materially lower ρ, trading accuracy for cost-per-correct-answer reduction. The deployed configuration is rationalized by Corollary 1 (frontier ceiling at fixed T), not by (7) with T free.

Q: Why does the commodity tier not deploy thinking budgets?

At α₀ > 0.95 on routine workloads, η_α^ρ is bounded above by 1 − α₀ < 0.05. The right-hand side of (7) is order unity. The threshold fails by an order of magnitude, predicting no rollout deployment. Commodity tiers (GPT-5.4 nano, Gemini Flash, Claude Haiku 4.5) confirm this prediction.

Q: What is the separability assumption and when does it break?

Separability (log α = log α₀(T) + h(ρ)) means a 1% increase in training compute and a 1% increase in rollouts contribute additively to log α. The cross-partial ∂²log α / ∂log T ∂log ρ vanishes. It is justified empirically when verifier-guided selection acts on a frozen generator distribution. It breaks when the verifier and generator have not fully absorbed each other's progress; in that regime (7) holds only locally.

Manu Bhardwaj

The Inference-Time Compute Frontier.

A Cost-Correct Threshold for Training Versus Test-Time Allocation.

Manu Bhardwaj. ifitsmanu.com. May 2026. Version 1.0. Research Paper #2 in the inference-economics wedge.

Download as PDF (full proofs, figures, calibration tables). LaTeX source. BibTeX of references. Cite this article. Papers index.

Companion to the verification-economics field notes. The Cost of Being Right. (Field Notes #2) develops the Cost-correct decomposition. The α Asymmetry. (Field Notes #3) shows verifier accept rate dominates the other cost levers. Verifier Procurement Under Unobservable Quality. (Research Paper #1) closes the gap when the deployer must buy rather than build. This paper answers a different question: given that you are building, when should the next dollar go to more rollouts rather than more training?

Or view the full PDF inline.

Abstract

When does an additional dollar of compute reduce cost-per-correct-answer faster when spent on inference-time scaling than when spent on further training? Snell et al. (2024) and Brown et al. (2024) show that test-time compute can substitute for training compute on hard reasoning tasks, and Guan et al. (2025) show that verifier-guided rollouts let small models match flagship reasoners. What none of them give is an economic threshold that says where the substitution holds. We derive one. Under the Cost-correct decomposition of The Cost of Being Right, with verifier accept rate parameterized jointly in training compute $T$ and rollout count $\rho$ , the marginal dollar reduces cost-per-correct-answer faster on the inference channel iff $(\eta_\alpha^\rho - 1)/\eta_\alpha^T$ exceeds the inference-to-training dollar ratio at the operating point. We calibrate the threshold against rStar-Math, DeepSeek-R1, and the published test-time-compute curves of Snell et al. (2024) and Brown et al. (2024), and show that the calibration matches the observed market split between frontier reasoning tiers and commodity tiers.

1. Introduction

Frontier reasoning models in 2025 ship with explicit thinking budgets. rStar-Math couples a 7B generator with a 7B process-reward verifier and Monte-Carlo Tree Search rollouts to beat o1-preview on AIME 2024 and MATH at a fraction of the inference dollar (Guan et al., 2025). DeepSeek-R1 lifts pass@1 on the same benchmarks through reinforcement learning with verifiable-reward signals at fixed rollout count (DeepSeek-AI, 2025). OpenAI’s o-series and the GPT-5.5 launch in April 2026 advertise per-query reasoning budgets as a first-class API parameter. Commodity tiers do not. GPT-5.4 nano, Gemini Flash, and Claude Haiku 4.5 ship without rollout budgets and serve a workload mix dominated by retrieval and short-form generation.

Two features of this split are striking. First, the split is sharp. There is no continuous gradient of “small thinking budget” tiers in the market; either a model deploys explicit inference-time scaling or it does not. Second, the split is recent. As late as 2024, even frontier providers shipped without dedicated rollout budgets, and the available economic frame was the Chinchilla compute-optimal training-data ratio of Hoffmann et al. (2022). The frame has since shifted to a question that the Chinchilla setup does not answer: where on the joint training-and-inference frontier should the next compute dollar go?

Snell et al. (2024) and Brown et al. (2024) answer the related question of substitutability but not the question of allocation. Snell et al. show on PaLM-2 that test-time compute can replace 14 times more pre-training compute on hard reasoning subsets. Brown et al. show on Llama-class models that pass@k under repeated sampling scales as an exponential decay in compute and that the curve crosses the parameter-scaling curve at a benchmark-dependent crossover. Both papers fix the verifier and report accuracy versus compute curves. Neither casts the result as a cost-allocation problem with explicit verifier construction cost, and neither isolates the conditions under which substitution holds.

This paper supplies the missing economic threshold. The contribution is a closed-form condition under which the marginal dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel, expressed in three observable parameters: the elasticity of verifier accept rate with respect to rollout count, the elasticity of accept rate with respect to training compute, and the inference-to-training dollar ratio at the operating point. The threshold derives from the Cost-correct decomposition and the verifier-dominance result of The α Asymmetry. It requires modeling the verifier accept rate as a joint function of both compute channels, taking partial derivatives in both, and identifying a closed-form switching condition.

We calibrate the threshold against four operating points. The threshold is crossed at the hard-difficulty subsets reported by Snell et al. and Brown et al.; it is not crossed at the easy subsets in the same papers, nor at the workload mixes implied by commodity-tier deployments. rStar-Math holds $T$ fixed at 7B and runs $\rho$ past the cost-correct optimum to chase headline accuracy on AIME 2024. DeepSeek-R1 sits at $\rho = 1$ where the threshold predicts the inference channel cannot clear the bar given the very high $\eta_\alpha^T$ the verifiable-reward RL stage realizes on the V3 base. The pattern matches the observed market split.

Inference-time scaling. Snell et al. (2024) study optimal allocation of test-time compute across rollouts, revisions, and search depth on PaLM-2. Brown et al. (2024) study repeated sampling on Llama and Pythia across HumanEval, MATH, GSM8K, and MiniF2F. Both papers hold the verifier fixed and treat it as an oracle. Neither incorporates verifier construction cost or partitions a budget across the training and inference channels.

Cost-of-pass and cost-correct. Erol et al. (2026) introduce Cost-of-Pass as a per-accepted-correct-answer metric. The Cost of Being Right develops the multiplicative Cost-correct decomposition that separates cost-per-million-tokens, the reasoning multiplier, the rollout ratio, and the verifier accept rate. The α Asymmetry shows the partial derivative of Cost-correct with respect to $\alpha$ dominates the other partials in production regimes. None study allocation across training and inference channels.

Compute-optimal training. Kaplan et al. (2020) and Hoffmann et al. (2022) establish single-channel scaling laws. The Chinchilla frontier optimizes training compute at a single inference operating point. It does not extend to a regime in which the next dollar can be allocated to inference-time rollouts that lift verifier accept rate.

A separate body of work on outcome- and process-reward verifiers (Cobbe et al. (2021) introduced outcome-reward verifiers on GSM8K; Lightman et al. (2023) drew the explicit ORM-vs-PRM distinction and showed step-level process-reward signals dominate on MATH) and verifier-guided decoding (Guan et al., 2025) supplies the empirical content of the elasticity calibrations in Section 4.

3. Method

3.1. Cost-correct, restated

We work in the Cost-correct framework. The unit cost of a correct answer is

C \;=\; \frac{\mathrm{CPM}_{1:1} \cdot R \cdot (1 + \bar\rho)}{\alpha}, \qquad (1)

where $\mathrm{CPM}_{1:1}$ is the blended cost per million tokens at a unit input-to-output ratio, $R$ is the reasoning multiplier (output tokens per accepted answer), $\bar\rho$ is the average rollout ratio, and $\alpha \in (0, 1]$ is the verifier accept rate. The α-asymmetry result establishes that

\Big| \tfrac{\partial \log C}{\partial \log \alpha} \Big| \;=\; 1 \;\geq\; \Big| \tfrac{\partial \log C}{\partial \log x} \Big|, \qquad x \in \{\mathrm{CPM}_{1:1}, R, \bar\rho\}, \qquad (2)

with equality approached in the high-rollout limit $\bar\rho \to \infty$ , where $\partial \log C / \partial \log \bar\rho = \bar\rho/(1+\bar\rho) \to 1$ . This asymmetry makes verifier accept rate the natural pivot for a two-channel allocation rule.

3.2. Two-channel parameterization

Let $T$ denote post-training compute spent on the generator (in FLOP-units) and $\rho$ denote rollout count per query. We parameterize the verifier accept rate as

\alpha(T, \rho) \;=\; g\bigl(\alpha_0(T),\, h(\rho)\bigr), \qquad (3)

where $\alpha_0(T)$ is the base accept rate of an unfiltered single rollout and $h(\rho)$ is the verifier lift from selecting the best of $\rho$ rollouts under a fixed verifier. We adopt the separability assumption

\log \alpha(T, \rho) \;=\; \log \alpha_0(T) + h(\rho), \qquad (4)

and define the elasticities

\eta_\alpha^T \;\equiv\; \frac{\partial \log \alpha}{\partial \log T}, \qquad \eta_\alpha^\rho \;\equiv\; \frac{\partial \log \alpha}{\partial \log \rho}. \qquad (5)

Under (4) the cross-partial $\partial^2 \log \alpha / \partial \log T \, \partial \log \rho$ vanishes. Separability is justified empirically when verifier-guided selection acts on a fixed generator distribution that has already absorbed the post-training lift, as in best-of-N reranking with a frozen process-reward model.

3.3. Cost ratio and budget constraint

Let $c_T$ denote the marginal cost of one unit of post-training FLOP, amortized over the expected query lifetime $Q$ , and $c_I$ the marginal cost of one unit of inference FLOP per query. Define

\nu \;\equiv\; \frac{c_T}{c_I}. \qquad (6)

Under public price points and the price-of-progress dataset of Liao et al. (2025), $\nu$ at the frontier operating point in 2026 is on the order of $10^{-5}$ to $10^{-4}$ per query when amortized over a generator’s commercial lifetime. The operationally relevant quantity is the dollar ratio $\mu \equiv (T \cdot c_T) / (\rho \cdot c_I)$ at the operating point.

3.4. The threshold theorem

The inference channel clears the threshold when the rollout-net-of-cost elasticity ratio exceeds the inference-to-training dollar ratio at the operating point.

We state the result in the rollout-dominant regime where $\rho \gg 1$ so that $(1 + \rho) \approx \rho$ . The general statement appears in Appendix A.

Theorem (Threshold). At an interior operating point $(T, \rho)$ with $\rho \gg 1$ under separability (4) and the cost ratio (6), the marginal dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel iff

\frac{\eta_\alpha^\rho \;-\; 1}{\eta_\alpha^T} \;>\; \frac{\rho \cdot c_I}{T \cdot c_T} \;=\; \frac{1}{\mu}. \qquad (7)

Proof. Take logs of (1). The fractional reduction in $C$ from a 1% increase in $T$ is $\eta_\alpha^T$ , at a dollar cost of $0.01 \cdot T \cdot c_T$ . The fractional reduction in $C$ from a 1% increase in $\rho$ is $\eta_\alpha^\rho - 1$ , at a dollar cost of $0.01 \cdot \rho \cdot c_I$ . Per-dollar log-reductions:

g_T \;=\; \frac{\eta_\alpha^T}{T \cdot c_T}, \qquad g_\rho \;=\; \frac{\eta_\alpha^\rho - 1}{\rho \cdot c_I}. \qquad (10)

The inference channel dominates iff $g_\rho > g_T$ . Cross-multiplying gives (7). $\square$

The theorem partitions the $(T, \rho)$ plane into a training-dominated region and an inference-dominated region. The optimum lies on the boundary, where (7) holds with equality. The right-hand side is observable from the deployment cost ledger. The left-hand side is the rollout-net-of-cost elasticity ratio: it credits rollouts only for the lift in $\alpha$ above the per-rollout cost $\rho/(1+\rho)$ , which in the rollout-dominant regime is unity.

3.5. Comparative statics

Three corollaries follow directly from (7).

Corollary 1 (frontier ceiling). As $\alpha_0 \to 1$ at fixed verifier, $\eta_\alpha^T \to 0$ . The right-hand side of (7) is bounded; the left-hand side grows without bound. Frontier-difficulty subsets satisfy the threshold; easy subsets do not.

Corollary 2 (reasoning multiplier). Tasks with high $R$ magnify the absolute dollar return to either channel. Combined with the α-asymmetry result, reasoning-heavy workloads favor inference-time allocation; retrieval-heavy workloads do not.

Corollary 3 (amortization). When $Q$ is large, $\nu$ falls and $\mu$ rises, so the inference channel must clear a lower bar to dominate. This predicts that high-throughput commodity tiers serving long-lived workloads do not deploy thinking budgets, because the cost-per-correct-answer reduction from rollouts on easy tasks is too small to clear even the lowered bar.

4. Experiments

This section calibrates the threshold (7) against four operating points. All numbers are cited from primary sources; we report no new measurements.

4.1. rStar-Math (Microsoft Research, January 2025)

Guan et al. (2025) report a Qwen2.5-Math-7B generator paired with a 7B process-reward verifier and MCTS rollouts. The deployed configuration runs $\rho = 64$ rollouts per query, reporting pass@1 of $0.533$ on AIME 2024 and $0.900$ on MATH-500.

The secant elasticity over the in-MCTS sweep $\rho = 8 \to 64$ on AIME 2024 is $\log(0.533/0.500)/\log(64/8) \approx 0.031$ ; on MATH-500 it is $\log(0.900/0.894)/\log(64/8) \approx 0.003$ .

Substituting into (7): $(\eta_\alpha^\rho - 1)/\eta_\alpha^T = (0.031 - 1)/\eta_\alpha^T \approx -0.97/\eta_\alpha^T < 0$ for any positive $\eta_\alpha^T$ . The inference channel does not clear the threshold at the deployed $\rho = 64$ . rStar-Math optimized headline accuracy at fixed model scale, not cost-per-correct-answer; the deployed configuration sits inside the verifier-ceiling regime (Corollary 1). A cost-conscious redeployment would run at materially lower $\rho$ , trading accuracy for cost-per-correct-answer reduction.

4.2. DeepSeek-R1 (DeepSeek-AI, January 2025)

DeepSeek-AI (2025) lift pass@1 on AIME 2024 from $0.392$ (DeepSeek-V3 base) to $0.798$ (DeepSeek-R1) through RL with verifiable-reward signals at fixed rollout count ( $\rho = 1$ ).

DeepSeek does not disclose RL post-training compute as a fraction of V3 pre-training. Under a sensitivity bracket $s = \Delta T / T_{V3} \in [0.01, 0.10]$ , the implied training-channel elasticity on AIME 2024 is $\log(0.798/0.392)/\log(1+s) \in [7.5, 71]$ .

For the inference channel to clear (7) at R1 would require $(\eta_\alpha^\rho - 1)/\eta_\alpha^T > 1/\mu$ , meaning $\eta_\alpha^\rho \gtrsim 8$ to $71$ , implausible for any published verifier on AIME 2024. The corner solution $\rho = 1$ is therefore consistent with (7) across the full sensitivity bracket.

4.3. Test-time-compute curves (Snell et al. 2024; Brown et al. 2024)

The hard-subset regime in Snell et al. (2024) corresponds to $\alpha_0$ far from 1 and $\eta_\alpha^\rho$ in the 0.5–1.0 range. The 14× substitution result implies $\eta_\alpha^\rho \cdot \mu \gg \eta_\alpha^T$ , exactly the threshold (7) in its $\eta_\alpha^\rho \gg 1$ form. The easy-subset regime corresponds to $\alpha_0 \to 1$ and $\eta_\alpha^\rho \to 0$ , where the threshold flips.

Brown et al. (2024) report the same pattern in pass@k form on Llama and Pythia. On hard benchmarks (MiniF2F, MATH-hard subsets), the exponent (the local $\eta_\alpha^\rho$ in our notation) is large and the substitution holds; on easy benchmarks the exponent is small and the substitution breaks. The crossover occurs precisely where $(\eta_\alpha^\rho - 1)/\eta_\alpha^T = 1/\mu$ , which is (7) with equality.

4.4. Negative case: commodity tiers

At $\alpha_0 > 0.95$ on routine-task workloads (short-form generation, retrieval, classification), $\eta_\alpha^\rho$ is bounded above by $1 - \alpha_0 < 0.05$ . The right-hand side of (7) is order unity. The threshold fails by an order of magnitude.

The prediction is that commodity tiers should not deploy explicit thinking budgets. They do not. The same prediction explains the absence of a continuous gradient of small-thinking-budget tiers between commodity and frontier.

**Table 1.** Threshold (7) calibration across four operating points. The threshold is crossed in the hard-reasoning regime and missed in all other cases, matching observed deployment choices.
Operating point	$\eta_\alpha^\rho$ (AIME 2024)	Threshold crossed?	Deployment fact
rStar-Math, $\rho = 64$	0.031 (secant)	No: $(\eta_\alpha^\rho - 1) < 0$	Fixed $T$, accuracy-optimized
DeepSeek-R1, $\rho = 1$	N/A ($\rho = 1$)	Consistent: corner $\rho = 1$	Training-channel RL allocation
Snell et al. hard subsets	0.5–1.0	Yes: 14× substitution	Test-time compute dominant
Commodity tiers	$< 0.05$	No: $\alpha_0 > 0.95$	No rollout budget deployed

5. Discussion

5.1. Capital allocation across the two channels

The threshold (7) gives a quantitative rule for where the next compute dollar should go. Frontier providers facing hard-reasoning workloads should mix, allocating to both channels along the boundary defined by equality in (7). Commodity providers facing easy-task workloads should allocate to the training channel only.

The observed market structure matches both predictions. The 2026 reasoning tier ships with thinking budgets that are themselves a tunable parameter: evidence that the provider sits on the boundary and lets the customer pick the operating point. Commodity tiers ship without rollout budgets at all: evidence that the provider sits well inside the training-dominated region.

5.2. The GPT-5.5 reprice as a falsifiable hypothesis

OpenAI raised GPT-5.5 prices by 100% over GPT-5.4 in April 2026. Under (1), a price increase at fixed $\mathrm{CPM}_{1:1}$ requires a fall in $\alpha$ , a rise in $R$ , or a rise in $\bar\rho$ . The threshold theorem rationalizes the move only if the GPT-5.5 workload mix has shifted toward harder reasoning tasks where the inference-channel allocation share has risen. This is consistent with OpenAI’s published statement on GPT-5.5 thinking budgets.

The hypothesis is falsifiable: if a future GPT-5.5 disclosure shows flat or falling rollout share on a workload mix shifted toward easy tasks, a competing explanation is required.

5.3. Limitations

Separability assumption. If $\eta_\alpha^\rho$ depends materially on $T$ (the verifier and generator have not absorbed each other’s progress), the cross-partial does not vanish and (7) holds only locally. A broader calibration tracing non-separability is future work.

Fixed verifier construction cost. We have treated verifier construction cost as amortized over the verifier’s lifetime. If the verifier does not transfer across tasks, the fixed-cost approximation breaks and the threshold shifts toward training-channel allocation.

Three-point calibration. A population-level calibration with the full 2024–2026 reasoning-model release sequence would tighten the elasticity estimates.

6. Conclusion

Inference-time scaling and training compute are substitutes on hard reasoning tasks, but allocation is a different question from substitutability. We have derived a closed-form threshold under the Cost-correct decomposition that says when the next dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel, calibrated the threshold against four operating points, and shown that the calibration matches the observed market split between frontier reasoning and commodity tiers.

The next paper in the sequence relaxes the separability assumption by treating verifier portability as the primary object of study.

Appendix A. Full proof of the threshold theorem

Theorem (Threshold, general). At an interior operating point $(T, \rho)$ under separability (4) and cost ratio (6), the marginal dollar reduces $C$ faster on the inference channel iff

\frac{\eta_\alpha^\rho \;-\; \tfrac{\rho}{1+\rho}}{\eta_\alpha^T} \;>\; \frac{\rho \cdot c_I}{T \cdot c_T}. \qquad (\mathrm{A.1})

Proof. From differentiating (8),

\frac{\partial \log C}{\partial T} = -\frac{\eta_\alpha^T}{T}, \qquad \frac{\partial \log C}{\partial \rho} = \frac{1}{1+\rho} - \frac{\eta_\alpha^\rho}{\rho}. \qquad (\mathrm{A.2})

The fractional change in $C$ per dollar on the training channel is $\eta_\alpha^T / (T \cdot c_T)$ . The fractional change in $C$ per dollar on the inference channel is $(\eta_\alpha^\rho/\rho - 1/(1+\rho)) / c_I$ . Setting the inference rate strictly greater than the training rate and rearranging gives (A.1). The $\rho \gg 1$ limit gives $\rho/(1+\rho) \to 1$ , recovering (7). $\square$

Corollary (boundary curvature). The boundary surface where (A.1) holds with equality is concave in the rollout-dominant regime; the iso-cost-correct curves in the same plane are convex; the optimum lies at the unique tangent point.

Appendix B. Calibration tables

Table B.1. rStar-Math operating points. Source: Guan et al. (2025), Table 5.

Model	Benchmark	$\rho$	pass@1	Notes
Qwen2.5-Math-7B base	AIME 2024	1	0.000	base generator, no MCTS
Qwen2.5-Math-7B base	MATH-500	1	0.588	base generator, no MCTS
rStar-Math (7B + 7B PRM)	AIME 2024	8	0.500	in-MCTS
rStar-Math (7B + 7B PRM)	MATH-500	8	0.894	in-MCTS
rStar-Math (7B + 7B PRM)	AIME 2024	64	0.533	deployed
rStar-Math (7B + 7B PRM)	MATH-500	64	0.900	deployed

Table B.2. DeepSeek-R1 vs DeepSeek-V3 base at $\rho = 1$ . Source: DeepSeek-AI (2025), Table 4.

Model	Benchmark	$\rho$	pass@1	Notes
DeepSeek-V3 base	AIME 2024	1	0.392	Pre-RL baseline
DeepSeek-R1-Zero	AIME 2024	1	0.710	Pure RL, no SFT
DeepSeek-R1	AIME 2024	1	0.798	Post verifiable-reward RL
DeepSeek-V3 base	MATH-500	1	0.902	Pre-RL baseline
DeepSeek-R1	MATH-500	1	0.973	Post verifiable-reward RL

Table B.3. Snell et al. (2024) headline substitution result on PaLM-2-S MATH subsets.

Subset	Substitution ratio (test-time / pre-training)	Threshold prediction
Hard MATH	14×	Crosses threshold
Easy MATH	<1×	Does not cross

Table B.4. Commodity-tier deployments (negative case). Source: Field Notes #1.

Model	Workload	$\bar\rho$ deployed	$\alpha$ on workload
GPT-5.4 nano	Retrieval / short-form	1	>0.95
Gemini Flash	Retrieval / short-form	1	>0.95
Claude Haiku 4.5	Retrieval / short-form	1	>0.95

References

Cite this article

@misc{bhardwaj2026inferencetimefrontier,
  author       = {Bhardwaj, Manu},
  title        = {The Inference-Time Compute Frontier: A Cost-Correct Threshold for Training Versus Test-Time Allocation},
  year         = {2026},
  month        = {May},
  url          = {https://ifitsmanu.com/papers/inference-frontier},
  howpublished = {\url{https://ifitsmanu.com/papers/inference-frontier/paper.pdf}},
  note         = {Working paper. Version 1.0.}
}

Bhardwaj, M. (2026, May). The inference-time compute frontier: A cost-correct threshold for training versus test-time allocation. ifitsmanu.com. https://ifitsmanu.com/papers/inference-frontier

Bhardwaj, Manu. "The Inference-Time Compute Frontier: A Cost-Correct Threshold for Training Versus Test-Time Allocation." ifitsmanu.com, May 2026. https://ifitsmanu.com/papers/inference-frontier.

M. Bhardwaj, "The Inference-Time Compute Frontier: A Cost-Correct Threshold for Training Versus Test-Time Allocation," ifitsmanu.com, May 2026. [Online]. Available: https://ifitsmanu.com/papers/inference-frontier

Companion. The Cost of Being Right. Companion. The α Asymmetry. Companion. Verifier Procurement. Papers index. Home.

The Inference-Time Compute Frontier. #

A Cost-Correct Threshold for Training Versus Test-Time Allocation. #

Abstract

1. Introduction #

2. Related work #

3. Method #

3.1. Cost-correct, restated #

3.2. Two-channel parameterization #

3.3. Cost ratio and budget constraint #

3.4. The threshold theorem #

3.5. Comparative statics #

4. Experiments #

4.1. rStar-Math (Microsoft Research, January 2025) #

4.2. DeepSeek-R1 (DeepSeek-AI, January 2025) #

4.3. Test-time-compute curves (Snell et al. 2024; Brown et al. 2024) #

4.4. Negative case: commodity tiers #

5. Discussion #

5.1. Capital allocation across the two channels #

5.2. The GPT-5.5 reprice as a falsifiable hypothesis #

5.3. Limitations #

6. Conclusion #

Appendix A. Full proof of the threshold theorem #

Appendix B. Calibration tables #

References #