Paper / May 16, 2026
Research Paper #1 in the AI systems engineering wedge. A closed-form decomposition of cost per SLO-compliant served token into a prefill term, a decode term, and a KV-transfer tax. Re-derives published throughput from five 2023–2025 systems papers (PagedAttention, Sarathi-Serve, DistServe, Splitwise, Mooncake) into a common frame, plots the first cross-system Pareto frontier under explicit p99 TTFT and p99 TPOT contracts, and solves the break-even surface between colocated and disaggregated architectures. The frontier partitions cleanly.
Paper / May 11, 2026
Research Paper #2 in the verification-economics wedge. Per-verifier strictly proper elicitation does not compose. Pipeline miscalibration under any monotone Boolean composition rule equals the within-instance verifier-disagreement covariance exactly. A joint scoring-rule mechanism on the cross-product report space restores DSIC and minimax-optimal regret of order sqrt((log K_1 + log K_2) / N). Per-component procurement records are insufficient evidence under the August 2026 EU AI Act high-risk obligations on composed pipelines.
Paper / May 15, 2026
Research Paper #2 in the inference-economics wedge. Derives a closed-form threshold under the Cost-correct decomposition for when the marginal compute dollar reduces cost-per-correct-answer faster on the inference channel than on the training channel. Calibrated against rStar-Math, DeepSeek-R1, and test-time-compute curves; matches the observed frontier-vs-commodity market split.
Paper / May 15, 2026
Research Paper #3 in the inference-economics wedge. Derives a closed-form threshold under the Cost-correct decomposition for when conditioning inference compute on a noisy difficulty estimate reduces cost-per-correct-answer: routing pays iff κ·Δ > γ, where κ is classifier calibration, Δ is workload heterogeneity, and γ is classifier overhead. Unifies five published patterns (speculative decoding, cascades, adaptive self-consistency, complexity-aware exploration, early exit) as one allocation rule, and calibrates against six deployed systems with every operating point on the positive side of the threshold.
Paper / May 10, 2026
Original research paper. Posted-price markets for verification-as-a-service collapse to the worst verifier under unobservable quality. A scoring-rule mechanism on adversarially constructed grounded probes is dominant-strategy incentive-compatible, with matching minimax regret bounds of order sqrt(log K / N).
Field Note / May 11, 2026
A daily field note on Mei, Li, Chen, Pan, Wu, Miao, Jia, and Rashmi's Coral. Cost-efficient multi-LLM serving over heterogeneous cloud GPUs. Why the fragmentation of the LLM market and the heterogeneity of GPU supply make joint allocation the binding cost lever.
Field Note / May 16, 2026
A daily field note on Gao, Zhao, Muhtar et al.'s ROSE. Cooperative elasticity for agentic RL rollouts on idle serving GPUs. Why the rollout-cost term in Cost-correct can be priced at the marginal-of-idle rate, and what that does to the inference-frontier threshold.
Field Note / May 17, 2026
A daily field note on Ma, Afzal, Eitzinger, and Wellein. Power capping does not bite in memory-bound LLM decode on NVIDIA H200. SM clock locking recovers up to 32% of decode energy. Why the standard energy lever moves the wrong knob, and what that does to the decode-cost term in Cost-correct.
Field Note / May 10, 2026
A daily field note on Lai, Feng, Teh, and Miao's VHG. Three-party setter-solver-verifier self-play. Why the verifier's job in the production lifecycle just expanded from two places to three.
Field Note / May 7, 2026
A field note on NVIDIA's Ising-Decoding release. Why the AI pre-decoder paired with correlated PyMatching stops improving logical error rate at distance 17 and above, and what to do about it.
Field Note / May 6, 2026
A field note showing why verifier accept rate can dominate the other levers in cost-per-correct-answer economics.
Field Note / May 6, 2026
A field note on why the operational unit of LLM inference economics is shifting from cost-per-token to cost-per-correct-answer.
Field Note / May 3, 2026
A field note on token economics, runtime systems, model architecture, and the stack changes behind public LLM API price compression.
Definition / Undated
A post-training quantization method that protects high-signal weights using activation statistics.
Definition / Undated
A decoding-time acceleration technique that drafts tokens with a smaller model and verifies them with a larger model.
Definition / Undated
A family of sequence models that replace quadratic attention with selective state-space mechanisms.
Definition / Undated
The study of cost-per-correct-answer and the verifier as a cost-and-value lever in reasoning systems.