AWQ quantization #
AWQ (Activation-Aware Weight Quantization) is a post-training quantization method for large language models. It compresses model weights from 16-bit floating point to 4-bit integer representation, reducing memory footprint by roughly 75 percent while preserving most of the model’s quality.
Definition #
AWQ analyzes activation patterns during a calibration pass over a small dataset, identifies the small subset of weights that carry disproportionate signal in the model’s output, and protects those weights from aggressive quantization. The remaining weights are quantized to INT4 with per-group scaling factors. The result is a model that fits in roughly one quarter of the VRAM of its BF16 baseline, runs faster on memory-bound inference workloads, and degrades quality less than naive INT4 quantization.
Mechanism #
The key observation behind AWQ is that not all weights matter equally. Activation magnitudes vary across channels, and the channels with large activations are sensitive to weight quantization error. AWQ uses an activation-aware scaling search to find a per-channel scale that minimizes the impact of quantization on the most important channels.
Concretely, the method:
- Runs calibration data through the unquantized model and collects activation statistics per channel.
- Solves for a per-channel scale factor that, when applied to the weights before quantization, minimizes the error introduced by 4-bit rounding.
- Quantizes weights with the chosen scales to INT4 with per-group dequantization at inference time.
Tradeoffs #
- Pros. Roughly 75 percent VRAM reduction. Faster inference on memory-bound workloads. Quality degradation is acceptable for most production tasks (summarization, classification, code completion).
- Cons. Quality degradation is noticeable on hard reasoning and math benchmarks. Requires a representative calibration dataset. Choice of group size (commonly 128) trades quality for memory.
Production status, 2026 #
AWQ is widely used for production LLM serving alongside GPTQ. Both are supported in vLLM, SGLang, and TensorRT-LLM. The throughput advantage of AWQ over baseline FP16 grows substantially when paired with Marlin kernels, a kernel family optimized for INT4 weight matrices on modern GPUs. Together, AWQ plus Marlin can deliver an order of magnitude more tokens per second than vanilla FP16 on the same hardware [1].
Related #
- Speculative decoding. Runtime technique that compounds with quantization for end-to-end throughput gains.
- The Inference Stack in 2026. Field note on the broader inference cost collapse.
References #
[1] VRLATech. LLM Quantization Explained: INT4, INT8, FP8, AWQ, and GPTQ in 2026. link