Search

Search Archive

Papers, field notes, programs, topics, reference surfaces, raw source, and citation-ready exports by Manu Bhardwaj.

50 items

Paper Disaggregated or Colocated? The Cost-Frontier of LLM Serving Under SLO Contracts. LLM serving in 2026 is not a single architecture. Colocated continuous batching, chunked-prefill colocation, and prefill/decode disaggregation each report goodput wins on different workload mixes against different baselines. Production teams pick architectures without a frontier to point at. We develop a closed-form decomposition of cost per SLO-compliant served token into a prefill term, a decode term, and a KV-transfer tax that applies only in disaggregated mode. We re-derive published throughput numbers from five 2023–2025 systems papers into a common frame. We plot the first cross-system Pareto frontier under explicit p99 TTFT and p99 TPOT contracts. We solve for the break-even surface between colocated and disaggregated architectures as a function of input/output ratio, arrival rate, KV-transfer bandwidth, and SLO slack. The frontier partitions. Disaggregation dominates the prefill-heavy long-context region. Chunked-prefill colocation dominates the decode-heavy short-context region. The crossover is sensitive to KV-transfer bandwidth and shifts visibly between A100, H100, and H200 deployments. Paper Calibration Drift Under Verifier Composition. A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization. Production large language model verification is composed. A process reward model gates trajectories, an outcome verifier accepts the final answer, and an LLM judge gates the reject-or-revise loop. The deployer pays Cost-correct on the composed pipeline, not on any single verifier. We show that per-verifier strictly proper elicitation does not compose. Pipeline miscalibration under any monotone Boolean composition rule equals the within-instance verifier-disagreement covariance exactly. A joint scoring-rule mechanism over the cross-product report space restores dominant-strategy incentive compatibility, ex post individual rationality, and budget feasibility. The deployer's expected gap to first-best Cost-correct is at most C_H · sqrt((log K_1 + log K_2) / N) over K_1 · K_2 candidate pairs, by Hoeffding plus a union bound; a matching lower bound holds on a calibration-monotone-pair family by Le Cam's two-point method. Simulation on MATH, GSM8K, and HumanEval reaches the 5%-of-first-best target at N = 512 under unknown joint correlation (roughly double Paper #1's N = 256), and at N = 256 when correlation is supplied as a side channel. The per-verifier baseline does not reach the target at any N tested when |C| ≥ 0.1. The compliance corollary is sharp. Per-component procurement records are insufficient evidence under the EU AI Act high-risk obligations entering force on August 2, 2026. The audit trail must include the joint-report ledger. Paper The Inference-Time Compute Frontier. A Cost-Correct Threshold for Training Versus Test-Time Allocation. When does an additional dollar of compute reduce cost-per-correct-answer faster when spent on inference-time scaling than when spent on further training? Under the Cost-correct decomposition, with verifier accept rate parameterized jointly in training compute T and rollout count ρ, the marginal dollar reduces cost-per-correct-answer faster on the inference channel iff (η_α^ρ − 1)/η_α^T exceeds the inference-to-training dollar ratio at the operating point. Calibrated against rStar-Math (threshold not crossed at ρ=64, Corollary 1 applies), DeepSeek-R1 (corner ρ=1 consistent with high η_α^T bracket), Snell et al. hard subsets (threshold crossed, 14× substitution), and commodity tiers (threshold fails at α₀>0.95). The calibration matches the observed market split between frontier reasoning tiers and commodity tiers. Paper The Routing Premium. An Economic Threshold for Difficulty-Conditional Inference Compute. When does conditioning inference compute on a noisy estimate of task difficulty reduce cost-per-correct-answer relative to a fixed-compute baseline? Five published patterns route compute on a difficulty signal. Two operate at the per-token or per-layer level: speculative decoding (Leviathan et al. 2023; Cai et al. 2024) and early-exit decoding (Schuster et al. 2022). Three operate at the per-query level: cascade routing (Chen et al. 2023), adaptive self-consistency (Petullo et al. 2026a), and complexity-aware exploration (Petullo and Xue 2026). None derives the threshold above which the routing rule pays. We derive one. Under the Cost-correct decomposition, the routing premium is positive iff κ·Δ > γ at the margin around the unconditional optimum, where κ is classifier calibration, Δ is workload heterogeneity in compute, and γ is classifier overhead. The condition unifies the five patterns as one allocation rule. We calibrate against six published systems spanning all five classes and find that every operating point sits on the positive side of the threshold. The elasticity reading isolates which operating points are close enough to fail under modest disclosure error. Paper Verifier Procurement Under Unobservable Quality. A Scoring-Rule Mechanism for Cost-Correct Minimization. A deployer of a large language model who does not train its own verifier must buy verification from a third party. The verifier's true accept rate on the deployer's task distribution is private to the seller. Public benchmark scores do not reveal it. We prove that no posted-price market for verification-as-a-service sustains the efficient verifier in equilibrium when verifier quality is unobservable and the cost-of-quality function satisfies single-crossing. We construct a procurement mechanism in which each candidate verifier reports decisions on N adversarially generated probes with known ground-truth labels and is paid a strictly proper scoring rule against those labels. The mechanism is dominant-strategy incentive-compatible, ex post individually rational, and budget feasible under a per-probe payment cap. Expected Cost-correct gap to oracle is at most a constant times sqrt(log K / N), with a matching lower bound on a calibration-monotone family. A simulation on MATH, GSM8K, and HumanEval confirms a 5% gap at N=256 under maximin-entropy probes, while posted-price baselines fail to close even 30% of the gap at any N tested. Field Note The Heterogeneous-GPU Margin. Coral and the Multi-LLM Procurement Problem. A daily field note on Mei, Li, Chen, Pan, Wu, Miao, Jia, and Rashmi (arXiv:2605.04357). Coral is an adaptive heterogeneity-aware multi-LLM serving system that jointly optimizes resource allocation and serving strategy across all model replicas in a fleet. The result identifies a fifth inference-economics lever structurally upstream of quantization, runtime, decoding-time parallelism, and per-replica hardware choice: fleet-level procurement of which model lands on which GPU class. The 2.79x cost reduction and 2.39x goodput-under-scarcity lifts come from the gap between a mixed-generation joint schedule and a homogeneous one. Field Note Harvesting Serving Slack. ROSE and the Collapsed Train-Serve Boundary. A daily field note on Gao, Zhao, Muhtar et al. (arXiv:2605.06534). ROSE is a cooperative, resource-elastic post-training system that runs agentic RL rollouts on idle serving GPUs. The economically interesting move is that the rollout term in the Cost-correct decomposition can now be priced at the marginal-of-idle rate rather than the dedicated-training-cluster rate, which forces a rewrite of the inference-frontier threshold across a previously clean train-serve boundary. Field Note The Power-Cap Illusion. SM Clock Locking and the Real Decode Lever. A daily field note on Ma, Afzal, Eitzinger, and Wellein (arXiv:2605.11999). Across GQA, Multi-head Latent Attention, Gated DeltaNet, and Mamba2 on NVIDIA H200, autoregressive decode draws only 137 to 300 W on a 700 W GPU and no power cap ever triggers. The cap is above the natural ceiling of a memory-bound workload that saturates HBM bandwidth rather than compute. SM clock locking is the lever actually on the critical path and Pareto-dominates power capping, recovering up to 32% of decode energy at minimal throughput loss. The paper identifies three architecture-dependent DVFS behavioral classes and reports a prefill-decode energy crossover that halves total request energy relative to GQA at production batch sizes. The economic consequence is a tightened decode-cost term in Cost-correct and a shift in the inference-frontier threshold in favor of memory-efficient attention replacements. Field Note The Verifier as Curriculum. VHG and the Third Role. A daily field note on Lai, Feng, Teh, and Miao (arXiv:2605.06660). VHG is a three-party setter-solver-verifier self-play framework that prevents reward hacking in synthetic problem generation by routing the setter's reward through an independent verifier before the solver's difficulty signal is applied. The result names a third production role for the verifier in the inference-economics framework: not just inference-time gate or training-time reward function, but training-data curator one level upstream of both. Field Note The Structural Residual Ceiling. AI Pre-Decoders for the Surface Code. NVIDIA's Ising-Decoding pre-decoder pipeline reaches a logical-error-rate ceiling at distance 17 and above when paired with correlated PyMatching. The ceiling is structural: it follows from the deterministic homological-equivalence canonicalization used to generate training labels, not from network capacity. Three falsifiable mitigations are outlined, all testable inside the released codebase. Field Note The Alpha Asymmetry. Why Verifiers Can Be Smaller Than Generators. Cost-correct is hyperbolic in verifier accept rate and linear in the other major inference-cost terms. In the operating regime where production reasoning workloads sit, verifier construction can move total cost more than comparable work on token price, reasoning length, or rollout policy. Field Note The Cost of Being Right. Verification Economics in 2026. Reasoning models, RL with verifiable rewards, and verifier-selected outputs shift the unit of account in LLM inference from cost-per-token to cost-per-correct-answer. The binding lever in this regime is verification. Field Note The Inference Stack in 2026. Public LLM API prices compressed sharply between 2023 and 2026, but the compression is uneven across model classes. The important levers are quantization, runtime systems, decoding-time parallelism, and hardware competition. Definition AWQ Quantization A post-training quantization method that protects high-signal weights using activation statistics. Definition Speculative Decoding A decoding-time acceleration technique that drafts tokens with a smaller model and verifies them with a larger model. Definition Mamba And State-Space Models A family of sequence models that replace quadratic attention with selective state-space mechanisms. Definition GPS-Denied Navigation Navigation under unreliable or unavailable satellite positioning, using onboard sensing and inference. Definition Edge AI Silicon The constrained compute layer for running inference close to sensors, robots, drones, and devices. Definition Verification Economics The study of cost-per-correct-answer and the verifier as a cost-and-value lever in reasoning systems. active Agent Infrastructure Runtime, memory, verification, tooling, and reliability layers for long-running agents. active Voice Systems Real-time voice systems under latency, turn-taking, reliability, and interface constraints. active Inference Economics Cost, quality, latency, verification, and hardware structure for running AI systems. emerging Human-Agent Interfaces Interfaces for operating AI-native systems without losing control, context, or trust. emerging Financial Infrastructure Market structure, research tooling, execution infrastructure, and AI-assisted financial systems. emerging Embedded Autonomy Autonomous behavior under power, compute, sensing, and deployment constraints. emerging Distributed Runtimes Runtime systems, state, scheduling, observability, and reliability for AI workloads. topic AI Systems Engineering Engineering AI systems across model behavior, runtime, evaluation, infrastructure, interfaces, and cost. topic Agent Infrastructure Runtime, memory, tooling, verification, and operating layers for long-running agent systems. topic Voice AI Systems Real-time speech and agent systems where latency, turn-taking, reliability, and interface behavior are binding constraints. topic Inference Economics The cost, latency, quality, and verification structure of running AI systems after training. topic Verification Economics A cost model centered on correct answers, verifier accept rates, and the economics of deciding whether outputs are usable. topic Financial Infrastructure Systems for research workflows, market structure, execution, risk, and financial automation. topic Market Structure The mechanisms, incentives, venues, and infrastructure that shape how markets route, price, and settle activity. topic Distributed Systems The coordination, reliability, state, and runtime behavior of systems spread across machines or services. topic Embedded Autonomy Autonomous behavior under compute, power, sensing, latency, and deployment constraints. topic Drones And Robotics Systems that combine perception, control, navigation, autonomy, embedded compute, and operational constraints. topic Human-Agent Interfaces The interaction layer between humans and AI systems, especially where trust, handoff, memory, and control matter. topic Operator Systems Personal, organizational, and technical systems for decision-making, automation, instrumentation, and execution. surface RSS publication feed surface Atom alternate feed surface BibTeX citation export surface Raw Markdown AI-readable source exports surface PDFs archival paper versions surface llms.txt curated LLM map surface llms-full.txt full text crawl surface surface Sitemap canonical URL map surface API JSON structured archive index identity About Manu Bhardwaj Canonical profile, identity links, areas of attention, and selected public work. contact Correspondence Selected technical notes, corrections, research questions, and direct correspondence. selected work Work with me Current lanes for advisory, podcast, and collaboration work.