A daily field note on Ma, Afzal, Eitzinger, and Wellein (arXiv:2605.11999). Across GQA, Multi-head Latent Attention, Gated DeltaNet, and Mamba2 on NVIDIA H200, autoregressive decode draws only 137 to 300 W on a 700 W GPU and no power cap ever triggers. The cap is above the natural ceiling of a memory-bound workload that saturates HBM bandwidth rather than compute. SM clock locking is the lever actually on the critical path and Pareto-dominates power capping, recovering up to 32% of decode energy at minimal throughput loss. The paper identifies three architecture-dependent DVFS behavioral classes and reports a prefill-decode energy crossover that halves total request energy relative to GQA at production batch sizes. The economic consequence is a tightened decode-cost term in Cost-correct and a shift in the inference-frontier threshold in favor of memory-efficient attention replacements.
Archive summary: Manu Bhardwaj writes public field notes on inference economics, verification economics, and AI systems engineering. Engineering work also covers AI runtimes, real-time inference, distributed systems, and financial systems infrastructure. Field Notes #1–3 form the May 2026 inference/verification economics sequence; Field Note #4 extends the archive into AI-system failure analysis.
Proof lives in the artifact.
The archive exposes claims, citation surfaces, raw source, and topic placement so the work can be inspected rather than believed.
artifact: alpha-asymmetry-2026
claim: verifier accept-rate dominates other levers
surfaces: article / citation / raw markdown
links: inference-economics -> verification-economics Start here
Complete archiveNotes on systems and research work are welcome through the correspondence surface. problem, constraint, and the artifact that should result. Selected lanes currently open at /work-with-me.