The Power-Cap Illusion. SM Clock Locking and the Real Decode Lever.
A Daily Field Note on Phase-Aware Energy in LLM Serving
Manu Bhardwaj. ifitsmanu.com. 17 May 2026. Last updated 17 May 2026. Version 1.0. Field Notes #8.
Cite this article. Research index. Companion. The Cost of Being Right. Companion. The Inference-Time Compute Frontier.
Daily field note. One fresh paper or post in the inference-economics or verification-economics wedge from the previous seven days, decomposed against what it shows and what it does not. Lighter scope than Field Notes #1 to #4. Same voice.
What it claims
Ma, Afzal, Eitzinger, and Wellein (2026), posted to arXiv on 12 May 2026, characterize the energy behavior of autoregressive LLM decode on NVIDIA H200. They cover four attention paradigms: GQA, Multi-head Latent Attention, Gated DeltaNet, and Mamba2. The headline negative finding is that power capping, the standard GPU energy lever in production LLM serving, is illusory in the phase that dominates production. Decode draws only 137 to 300 W on a 700 W GPU, so no cap ever triggers. The headline positive finding is that locking the SM clock, the lever that is actually on the critical path of memory-bound decode, Pareto-dominates power capping and recovers up to 32% of decode energy at minimal throughput loss.
The paper also identifies three architecture-dependent DVFS behavioral classes across the four attention variants, and reports a common energy pattern across the novel attention replacements. A heavier prefill cost is recouped by more efficient decode, eventually halving total request energy relative to GQA at production batch sizes.
How it argues
The argument has three parts, each addressing a distinct way the standard energy story fails.
The first part. Bandwidth, not compute, is the binding constraint in decode. Memory-bound kernels saturate HBM bandwidth long before they approach the compute envelope. Power readings drop because the SMs are stalled waiting on memory. The requested cap never bites because the cap is set above the natural power ceiling of the workload. A “successful” power cap in this regime is power capping watching a workload that was already below the cap.
The second part. Firmware-initiated clock throttling. The H200 occasionally throttles SM clocks for thermal or reliability reasons that have nothing to do with the operator’s energy policy. These throttles produce throughput dips that any observer trying to attribute energy savings to the cap will misread as cap-induced. Controlled experiments require pinning the SM clock; otherwise the cap and the firmware throttle are confounded in the measurement.
The third part. SM clock locking is the correct lever. Once the bandwidth ceiling is named and the firmware confound is removed, the remaining knob is SM frequency. Lowering it slightly trades a small amount of throughput for a much larger reduction in dynamic power, because memory-bound decode underutilizes the per-cycle compute. The 32% energy recovery sits inside this region. Power capping, by contrast, never reaches it.
The cross-architecture pattern is the second economically meaningful claim. Novel attention replacements (MLA, Gated DeltaNet, Mamba2) pay more energy at prefill than GQA, then recoup the cost across decode tokens. At small batch and short context the GQA baseline is competitive. At production batch and context the alternatives halve request energy. The crossover is real and it is architecture-class-specific.
What is interesting
Three things.
First, this is a clean separation of “what knob does the operator turn” from “what knob is on the critical path.” Production LLM serving currently turns the wrong knob, and the measurement infrastructure for energy reporting agrees that the knob worked, because power did fall. The fact that it fell because of the workload, not because of the cap, is invisible without the controlled measurement protocol the paper documents.
Second, the result is the energy-domain analogue of the Cost-correct claim in The Cost of Being Right. Cost-correct says that pricing inference against the wrong cost basis hides the relevant economic choice. Here, pricing energy against the cap line hides the lever that actually controls energy. The structural shape of the error is the same. A measurement that looks correct because it sums to the expected number conceals the fact that the operator’s actual degree of freedom is not the one being recorded.
Third, the prefill-decode energy crossover is a constraint on the inference frontier, not just an operating tip. If MLA, Gated DeltaNet, and Mamba2 halve decode energy at production batch sizes, the inference-time training-versus-test-time threshold derived in The Inference-Time Compute Frontier shifts in favor of test-time allocation for those architectures, because the per-test-token energy term shrinks while the training-energy term does not.
What is missing
Four things.
The energy numbers are reported on H200. The paper does not extend the measurement to H100, B200, or GB300, where bandwidth ratios, HBM stack sizes, and DVFS firmware behavior differ. The 32% recovery is conditional on the H200 frequency-voltage curve and the H200 firmware throttle behavior. Whether the same lever delivers the same headroom on B200 is open.
The 32% is decode-only. Prefill is mostly compute-bound, and the cap may well do real work there. A complete operator policy is a per-phase policy. Clock-lock in decode, possibly let the cap engage during prefill. The paper hints at this but does not deliver it.
The work prices only operator-side electrical energy. Capex and embodied-energy terms are out of scope. The economic claim should land at “operator energy per decode token,” not at “total energy of inference,” and the paper is honest about this only in passing.
Finally, the three DVFS behavioral classes are not named at the level of generality that an operator would need to predict which class a new attention variant falls into without first measuring it. A taxonomy that bound class to architectural property would extend the result from observational to predictive. The paper documents three classes; it does not derive them.
Why it matters now
Production LLM serving is the largest single new electricity load arriving on the cloud grid, and its energy controls are being audited by the same procurement and compliance teams that audit verification. If those teams treat power-cap compliance as a proxy for energy efficiency, they are reading the wrong field on the invoice. The lever is SM frequency. The right control plane reports decode-phase clock-locking policy, not aggregate cap settings.
For inference-economics work the result tightens the decode-cost term in Cost-correct by a measurable factor and shifts the architectural cost asymmetry in favor of memory-efficient attention replacements at production batch. For verification-economics work it forecasts that audit ledgers will need to record clock-locking policy alongside cap configuration once the standards bodies catch up to the measurement.
The pattern repeats. The lever the operator pulls is not the lever that moves the cost. Naming the right lever, and pricing it, is what this field-note archive exists for.
Source
Related field notes
- Field Notes #2. The Cost of Being Right. Verification Economics in 2026.
- Research Paper #1. The Inference-Time Compute Frontier.
Cite this article
@misc{bhardwaj2026powercapillusion,
author = {Bhardwaj, Manu},
title = {The Power-Cap Illusion: SM Clock Locking and the Real Decode Lever},
year = {2026},
month = {May},
url = {https://ifitsmanu.com/papers/the-power-cap-illusion},
note = {Field note. Field Notes \#8. Daily review of arXiv:2605.11999. Version 1.0.}
}