AWQ quantization

AWQ (Activation-Aware Weight Quantization) is a post-training quantization method for large language models. It compresses model weights from 16-bit floating point to 4-bit integer representation, reducing memory footprint by roughly 75 percent while preserving most of the model’s quality.

Definition

AWQ analyzes activation patterns during a calibration pass over a small dataset, identifies the small subset of weights that carry disproportionate signal in the model’s output, and protects those weights from aggressive quantization. The remaining weights are quantized to INT4 with per-group scaling factors. The result is a model that fits in roughly one quarter of the VRAM of its BF16 baseline, runs faster on memory-bound inference workloads, and degrades quality less than naive INT4 quantization.

Mechanism

The key observation behind AWQ is that not all weights matter equally. Activation magnitudes vary across channels, and the channels with large activations are sensitive to weight quantization error. AWQ uses an activation-aware scaling search to find a per-channel scale that minimizes the impact of quantization on the most important channels.

Concretely, the method:

  1. Runs calibration data through the unquantized model and collects activation statistics per channel.
  2. Solves for a per-channel scale factor that, when applied to the weights before quantization, minimizes the error introduced by 4-bit rounding.
  3. Quantizes weights with the chosen scales to INT4 with per-group dequantization at inference time.

Tradeoffs

Production status, 2026

AWQ is widely used for production LLM serving alongside GPTQ. Both are supported in vLLM, SGLang, and TensorRT-LLM. The throughput advantage of AWQ over baseline FP16 grows substantially when paired with Marlin kernels, a kernel family optimized for INT4 weight matrices on modern GPUs. Together, AWQ plus Marlin can deliver an order of magnitude more tokens per second than vanilla FP16 on the same hardware [1].

References

[1] VRLATech. LLM Quantization Explained: INT4, INT8, FP8, AWQ, and GPTQ in 2026. link


Glossary. Research index. Home.