Latest surface / Field Note / May 17, 2026

A daily field note on Ma, Afzal, Eitzinger, and Wellein (arXiv:2605.11999). Across GQA, Multi-head Latent Attention, Gated DeltaNet, and Mamba2 on NVIDIA H200, autoregressive decode draws only 137 to 300 W on a 700 W GPU and no power cap ever triggers. The cap is above the natural ceiling of a memory-bound workload that saturates HBM bandwidth rather than compute. SM clock locking is the lever actually on the critical path and Pareto-dominates power capping, recovering up to 32% of decode energy at minimal throughput loss. The paper identifies three architecture-dependent DVFS behavioral classes and reports a prefill-decode energy crossover that halves total request energy relative to GQA at production batch sizes. The economic consequence is a tightened decode-cost term in Cost-correct and a shift in the inference-frontier threshold in favor of memory-efficient attention replacements.

Read note Raw source

Archive summary: Manu Bhardwaj writes public field notes on inference economics, verification economics, and AI systems engineering. Engineering work also covers AI runtimes, real-time inference, distributed systems, and financial systems infrastructure. Field Notes #1–3 form the May 2026 inference/verification economics sequence; Field Note #4 extends the archive into AI-system failure analysis.

Artifact specimen

Proof lives in the artifact.

The archive exposes claims, citation surfaces, raw source, and topic placement so the work can be inspected rather than believed.

artifact: alpha-asymmetry-2026
claim: verifier accept-rate dominates other levers
surfaces: article / citation / raw markdown
links: inference-economics -> verification-economics

Start here

Complete archive

Correspondence

Notes on systems and research work are welcome through the correspondence surface. problem, constraint, and the artifact that should result. Selected lanes currently open at /work-with-me.