Paper / May 16, 2026
Research Paper #1 in the AI systems engineering wedge. A closed-form decomposition of cost per SLO-compliant served token into a prefill term, a decode term, and a KV-transfer tax. Re-derives published throughput from five 2023–2025 systems papers (PagedAttention, Sarathi-Serve, DistServe, Splitwise, Mooncake) into a common frame, plots the first cross-system Pareto frontier under explicit p99 TTFT and p99 TPOT contracts, and solves the break-even surface between colocated and disaggregated architectures. The frontier partitions cleanly.
Paper / May 15, 2026
Research Paper #2 in the inference-economics wedge. Derives a closed-form threshold under the Cost-correct decomposition for when the marginal compute dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel. Calibrated against rStar-Math, DeepSeek-R1, and test-time-compute curves; matches the observed frontier-vs-commodity market split.
Paper / May 15, 2026
Research Paper #3 in the inference-economics wedge. Derives a closed-form threshold under the Cost-correct decomposition for when conditioning inference compute on a noisy difficulty estimate reduces cost-per-correct-answer: routing pays iff κ·Δ > γ, where κ is classifier calibration, Δ is workload heterogeneity, and γ is classifier overhead. Unifies five published patterns (speculative decoding, cascades, adaptive self-consistency, complexity-aware exploration, early exit) as one allocation rule, and calibrates against six deployed systems with every operating point on the positive side of the threshold.
Field Note / May 11, 2026
A daily field note on Mei, Li, Chen, Pan, Wu, Miao, Jia, and Rashmi's Coral. Cost-efficient multi-LLM serving over heterogeneous cloud GPUs. Why the fragmentation of the LLM market and the heterogeneity of GPU supply make joint allocation the binding cost lever.
Field Note / May 16, 2026
A daily field note on Gao, Zhao, Muhtar et al.'s ROSE. Cooperative elasticity for agentic RL rollouts on idle serving GPUs. Why the rollout-cost term in Cost-correct can be priced at the marginal-of-idle rate, and what that does to the inference-frontier threshold.
Field Note / May 17, 2026
A daily field note on Ma, Afzal, Eitzinger, and Wellein. Power capping does not bite in memory-bound LLM decode on NVIDIA H200. SM clock locking recovers up to 32% of decode energy. Why the standard energy lever moves the wrong knob, and what that does to the decode-cost term in Cost-correct.
Field Note / May 10, 2026
A daily field note on Lai, Feng, Teh, and Miao's VHG. Three-party setter-solver-verifier self-play. Why the verifier's job in the production lifecycle just expanded from two places to three.
Field Note / May 6, 2026
A field note showing why verifier accept rate can dominate the other levers in cost-per-correct-answer economics.
Field Note / May 6, 2026
A field note on why the operational unit of LLM inference economics is shifting from cost-per-token to cost-per-correct-answer.
Field Note / May 3, 2026
A field note on token economics, runtime systems, model architecture, and the stack changes behind public LLM API price compression.
Definition / Undated
A post-training quantization method that protects high-signal weights using activation statistics.
Definition / Undated
A decoding-time acceleration technique that drafts tokens with a smaller model and verifies them with a larger model.
Definition / Undated
A family of sequence models that replace quadratic attention with selective state-space mechanisms.
Definition / Undated
The constrained compute layer for running inference close to sensors, robots, drones, and devices.
Definition / Undated
The study of cost-per-correct-answer and the verifier as a cost-and-value lever in reasoning systems.