Research Paper #1 in the AI systems engineering wedge. A closed-form decomposition of cost per SLO-compliant served token into a prefill term, a decode term, and a KV-transfer tax. Re-derives published throughput from five 2023–2025 systems papers (PagedAttention, Sarathi-Serve, DistServe, Splitwise, Mooncake) into a common frame, plots the first cross-system Pareto frontier under explicit p99 TTFT and p99 TPOT contracts, and solves the break-even surface between colocated and disaggregated architectures. The frontier partitions cleanly.
A daily field note on Gao, Zhao, Muhtar et al.'s ROSE. Cooperative elasticity for agentic RL rollouts on idle serving GPUs. Why the rollout-cost term in Cost-correct can be priced at the marginal-of-idle rate, and what that does to the inference-frontier threshold.
A daily field note on Ma, Afzal, Eitzinger, and Wellein. Power capping does not bite in memory-bound LLM decode on NVIDIA H200. SM clock locking recovers up to 32% of decode energy. Why the standard energy lever moves the wrong knob, and what that does to the decode-cost term in Cost-correct.
A field note on NVIDIA's Ising-Decoding release. Why the AI pre-decoder paired with correlated PyMatching stops improving logical error rate at distance 17 and above, and what to do about it.
Archive command
Search the public field-note archive.
Results are local, static, and index the same public surfaces exposed to crawlers.