Raw Markdown

Source Exports

Markdown versions of the public papers. These source surfaces are kept readable for humans, search systems, and retrieval tools.

Markdown / 2026-05-16

Disaggregated or Colocated? The Cost-Frontier of LLM Serving Under SLO Contracts.

LLM serving in 2026 is not a single architecture. Colocated continuous batching, chunked-prefill colocation, and prefill/decode disaggregation each report goodput wins on different workload mixes against different baselines. Production teams pick architectures without a frontier to point at. We develop a closed-form decomposition of cost per SLO-compliant served token into a prefill term, a decode term, and a KV-transfer tax that applies only in disaggregated mode. We re-derive published throughput numbers from five 2023–2025 systems papers into a common frame. We plot the first cross-system Pareto frontier under explicit p99 TTFT and p99 TPOT contracts. We solve for the break-even surface between colocated and disaggregated architectures as a function of input/output ratio, arrival rate, KV-transfer bandwidth, and SLO slack. The frontier partitions. Disaggregation dominates the prefill-heavy long-context region. Chunked-prefill colocation dominates the decode-heavy short-context region. The crossover is sensitive to KV-transfer bandwidth and shifts visibly between A100, H100, and H200 deployments.

Markdown / 2026-05-11

Calibration Drift Under Verifier Composition. A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization.

Production large language model verification is composed. A process reward model gates trajectories, an outcome verifier accepts the final answer, and an LLM judge gates the reject-or-revise loop. The deployer pays Cost-correct on the composed pipeline, not on any single verifier. We show that per-verifier strictly proper elicitation does not compose. Pipeline miscalibration under any monotone Boolean composition rule equals the within-instance verifier-disagreement covariance exactly. A joint scoring-rule mechanism over the cross-product report space restores dominant-strategy incentive compatibility, ex post individual rationality, and budget feasibility. The deployer's expected gap to first-best Cost-correct is at most C_H · sqrt((log K_1 + log K_2) / N) over K_1 · K_2 candidate pairs, by Hoeffding plus a union bound; a matching lower bound holds on a calibration-monotone-pair family by Le Cam's two-point method. Simulation on MATH, GSM8K, and HumanEval reaches the 5%-of-first-best target at N = 512 under unknown joint correlation (roughly double Paper #1's N = 256), and at N = 256 when correlation is supplied as a side channel. The per-verifier baseline does not reach the target at any N tested when |C| ≥ 0.1. The compliance corollary is sharp. Per-component procurement records are insufficient evidence under the EU AI Act high-risk obligations entering force on August 2, 2026. The audit trail must include the joint-report ledger.

Markdown / 2026-05-15

The Inference-Time Compute Frontier. A Cost-Correct Threshold for Training Versus Test-Time Allocation.

When does an additional dollar of compute reduce cost-per-correct-answer faster when spent on inference-time scaling than when spent on further training? Under the Cost-correct decomposition, with verifier accept rate parameterized jointly in training compute T and rollout count ρ, the marginal dollar reduces cost-per-correct-answer faster on the inference channel iff (η_α^ρ − 1)/η_α^T exceeds the inference-to-training dollar ratio at the operating point. Calibrated against rStar-Math (threshold not crossed at ρ=64, Corollary 1 applies), DeepSeek-R1 (corner ρ=1 consistent with high η_α^T bracket), Snell et al. hard subsets (threshold crossed, 14× substitution), and commodity tiers (threshold fails at α₀>0.95). The calibration matches the observed market split between frontier reasoning tiers and commodity tiers.

Markdown / 2026-05-15

The Routing Premium. An Economic Threshold for Difficulty-Conditional Inference Compute.

When does conditioning inference compute on a noisy estimate of task difficulty reduce cost-per-correct-answer relative to a fixed-compute baseline? Five published patterns route compute on a difficulty signal. Two operate at the per-token or per-layer level: speculative decoding (Leviathan et al. 2023; Cai et al. 2024) and early-exit decoding (Schuster et al. 2022). Three operate at the per-query level: cascade routing (Chen et al. 2023), adaptive self-consistency (Petullo et al. 2026a), and complexity-aware exploration (Petullo and Xue 2026). None derives the threshold above which the routing rule pays. We derive one. Under the Cost-correct decomposition, the routing premium is positive iff κ·Δ > γ at the margin around the unconditional optimum, where κ is classifier calibration, Δ is workload heterogeneity in compute, and γ is classifier overhead. The condition unifies the five patterns as one allocation rule. We calibrate against six published systems spanning all five classes and find that every operating point sits on the positive side of the threshold. The elasticity reading isolates which operating points are close enough to fail under modest disclosure error.

Markdown / 2026-05-10

Verifier Procurement Under Unobservable Quality. A Scoring-Rule Mechanism for Cost-Correct Minimization.

A deployer of a large language model who does not train its own verifier must buy verification from a third party. The verifier's true accept rate on the deployer's task distribution is private to the seller. Public benchmark scores do not reveal it. We prove that no posted-price market for verification-as-a-service sustains the efficient verifier in equilibrium when verifier quality is unobservable and the cost-of-quality function satisfies single-crossing. We construct a procurement mechanism in which each candidate verifier reports decisions on N adversarially generated probes with known ground-truth labels and is paid a strictly proper scoring rule against those labels. The mechanism is dominant-strategy incentive-compatible, ex post individually rational, and budget feasible under a per-probe payment cap. Expected Cost-correct gap to oracle is at most a constant times sqrt(log K / N), with a matching lower bound on a calibration-monotone family. A simulation on MATH, GSM8K, and HumanEval confirms a 5% gap at N=256 under maximin-entropy probes, while posted-price baselines fail to close even 30% of the gap at any N tested.

Markdown / 2026-05-11

The Heterogeneous-GPU Margin. Coral and the Multi-LLM Procurement Problem.

A daily field note on Mei, Li, Chen, Pan, Wu, Miao, Jia, and Rashmi (arXiv:2605.04357). Coral is an adaptive heterogeneity-aware multi-LLM serving system that jointly optimizes resource allocation and serving strategy across all model replicas in a fleet. The result identifies a fifth inference-economics lever structurally upstream of quantization, runtime, decoding-time parallelism, and per-replica hardware choice: fleet-level procurement of which model lands on which GPU class. The 2.79x cost reduction and 2.39x goodput-under-scarcity lifts come from the gap between a mixed-generation joint schedule and a homogeneous one.

Markdown / 2026-05-16

Harvesting Serving Slack. ROSE and the Collapsed Train-Serve Boundary.

A daily field note on Gao, Zhao, Muhtar et al. (arXiv:2605.06534). ROSE is a cooperative, resource-elastic post-training system that runs agentic RL rollouts on idle serving GPUs. The economically interesting move is that the rollout term in the Cost-correct decomposition can now be priced at the marginal-of-idle rate rather than the dedicated-training-cluster rate, which forces a rewrite of the inference-frontier threshold across a previously clean train-serve boundary.

Markdown / 2026-05-17

The Power-Cap Illusion. SM Clock Locking and the Real Decode Lever.

A daily field note on Ma, Afzal, Eitzinger, and Wellein (arXiv:2605.11999). Across GQA, Multi-head Latent Attention, Gated DeltaNet, and Mamba2 on NVIDIA H200, autoregressive decode draws only 137 to 300 W on a 700 W GPU and no power cap ever triggers. The cap is above the natural ceiling of a memory-bound workload that saturates HBM bandwidth rather than compute. SM clock locking is the lever actually on the critical path and Pareto-dominates power capping, recovering up to 32% of decode energy at minimal throughput loss. The paper identifies three architecture-dependent DVFS behavioral classes and reports a prefill-decode energy crossover that halves total request energy relative to GQA at production batch sizes. The economic consequence is a tightened decode-cost term in Cost-correct and a shift in the inference-frontier threshold in favor of memory-efficient attention replacements.

Markdown / 2026-05-10

The Verifier as Curriculum. VHG and the Third Role.

A daily field note on Lai, Feng, Teh, and Miao (arXiv:2605.06660). VHG is a three-party setter-solver-verifier self-play framework that prevents reward hacking in synthetic problem generation by routing the setter's reward through an independent verifier before the solver's difficulty signal is applied. The result names a third production role for the verifier in the inference-economics framework: not just inference-time gate or training-time reward function, but training-data curator one level upstream of both.

Markdown / 2026-05-07

The Structural Residual Ceiling. AI Pre-Decoders for the Surface Code.

NVIDIA's Ising-Decoding pre-decoder pipeline reaches a logical-error-rate ceiling at distance 17 and above when paired with correlated PyMatching. The ceiling is structural: it follows from the deterministic homological-equivalence canonicalization used to generate training labels, not from network capacity. Three falsifiable mitigations are outlined, all testable inside the released codebase.

Markdown / 2026-05-03

The Inference Stack in 2026.

Public LLM API prices compressed sharply between 2023 and 2026, but the compression is uneven across model classes. The important levers are quantization, runtime systems, decoding-time parallelism, and hardware competition.