Speculative decoding #
Speculative decoding is an inference-time technique that accelerates large language model serving by 2 to 3 times with no quality loss. It uses a small draft model to propose multiple tokens in advance, then verifies them in a single forward pass of the larger target model.
Definition #
In standard autoregressive decoding, generating n tokens requires n sequential forward passes through the model. Each pass produces one token, conditioned on all previous tokens. The bottleneck is the sequential nature of the dependency.
Speculative decoding breaks this dependency by introducing a fast draft model that runs in parallel. The draft model proposes k candidate tokens at once. The target model then performs a single forward pass over those k tokens, computing the probability distributions it would have produced if generating them itself. The longest prefix where the target model agrees with the draft is accepted in one shot. The first disagreement is corrected and decoding continues from there.
Why it works #
Modern LLM inference is memory-bandwidth-bound. A single forward pass of a large model spends most of its time loading parameters from HBM into compute, not actually computing. Verifying k tokens in one pass is nearly free relative to verifying one. The draft model’s work is amortized.
When the draft model is good (high acceptance rate), most generations get multiple tokens per target-model pass. End-to-end latency drops, throughput rises, quality is unchanged.
Tradeoffs #
- Pros. No quality loss versus the target model. Compounds with quantization and other runtime tricks. Production-ready in vLLM, TensorRT-LLM, and SGLang.
- Cons. Requires a compatible draft model (smaller, same tokenizer, similar distribution). Acceptance rate dominates effective speedup; a poor draft model gives marginal gains.
Practical speedups #
In production, end-to-end speedups of 2 to 3 times are routine when the draft model has a high acceptance rate (60 to 80 percent). For long-context generation with predictable structure, gains can be higher.
Related #
- AWQ quantization. Reduces memory pressure; compounds with speculative decoding.
- The Inference Stack in 2026. Section 3 explains the runtime stack including continuous batching, PagedAttention, and speculative decoding.