Speculative decoding

Speculative decoding is an inference-time technique that accelerates large language model serving by 2 to 3 times with no quality loss. It uses a small draft model to propose multiple tokens in advance, then verifies them in a single forward pass of the larger target model.

Definition

In standard autoregressive decoding, generating n tokens requires n sequential forward passes through the model. Each pass produces one token, conditioned on all previous tokens. The bottleneck is the sequential nature of the dependency.

Speculative decoding breaks this dependency by introducing a fast draft model that runs in parallel. The draft model proposes k candidate tokens at once. The target model then performs a single forward pass over those k tokens, computing the probability distributions it would have produced if generating them itself. The longest prefix where the target model agrees with the draft is accepted in one shot. The first disagreement is corrected and decoding continues from there.

Why it works

Modern LLM inference is memory-bandwidth-bound. A single forward pass of a large model spends most of its time loading parameters from HBM into compute, not actually computing. Verifying k tokens in one pass is nearly free relative to verifying one. The draft model’s work is amortized.

When the draft model is good (high acceptance rate), most generations get multiple tokens per target-model pass. End-to-end latency drops, throughput rises, quality is unchanged.

Tradeoffs

Practical speedups

In production, end-to-end speedups of 2 to 3 times are routine when the draft model has a high acceptance rate (60 to 80 percent). For long-context generation with predictable structure, gains can be higher.


Glossary. Research index. Home.