How Speculative Decoding Speeds Up LLM Inference (2-3x, Lossless)

Generating text from a large language model is slow for a reason that has nothing to do with how smart the model is. Every new token requires a full forward pass over the entire network, one token at a time, and each of those passes is bottlenecked by memory bandwidth, not compute. The GPU spends most of its time streaming billions of weights out of memory; the actual matrix math finishes long before the next batch of weights has arrived. The arithmetic units sit mostly idle.

That idle time is the loophole. A forward pass that scores one token and a forward pass that scores five tokens cost almost the same wall-clock time, because both are dominated by loading weights, not by the math. Speculative decoding is the trick that turns that fact into a 2-3x speedup without changing a single output token.

Draft, then verify

The idea is a guess-and-check loop between two models.

A small, fast draft model proposes the next K tokens. It runs autoregressively like any model, but it's tiny, so generating a short guess is cheap. The large target model then takes all K drafted tokens and verifies them in a single forward pass, scoring every position in parallel. It accepts the longest correct prefix. If it accepts h of the K tokens, it also gets the (h+1)-th token for free out of that same verification pass, because the pass already computed the distribution at that position. Then the loop repeats.

The payoff: you spend one expensive forward pass to advance potentially several tokens instead of one. The closer the draft tracks the target, the more tokens clear per pass.

If this sounds like speculative execution in a CPU pipeline, that's exactly the borrowed intuition — compute ahead on a prediction, keep the work if the prediction held, discard it if it didn't. The name comes from the same place.

Diagram-style illustration of a small model drafting several tokens and a large model checking them in parallel

Why the output is identical

Here's the part that trips people up: speculative decoding does not approximate the big model. It produces the exact same output distribution as plain autoregressive sampling from the target. It is a speed optimization, not a quality tradeoff, and the target model is never retrained or modified.

The mechanism is a modified rejection sampling scheme. Each drafted token is accepted with a probability that depends on the ratio of the target's probability to the draft's probability for that token. When a token is rejected, it's resampled from an adjusted residual distribution. Work through the math and the accepted-plus-resampled stream is provably distributed exactly as if you'd sampled from the target directly. Greedy decoding falls out as the special case where you simply check that the draft's argmax matches the target's.

One honest caveat: "lossless" is a property of the algorithm, not a guarantee about any given implementation. A 2025 paper, Batch Speculative Decoding Done Right, found that real systems — including ones from major labs — broke the guarantee by sampling the free bonus token from the draft distribution instead of the target distribution. The math is lossless; a careless implementation isn't. If correctness matters, test that your serving stack reproduces the target's outputs.

Where the draft comes from

The original method used a separate small model. Most of the gains since have come from cheaper ways to produce a draft.

Method	How the draft is produced	Needs a separate model?
Separate draft model (classic)	A small standalone model from the same family (e.g. a 1B drafting for a 70B)	Yes
Self-speculative / layer-skipping	The target model runs a reduced version of itself (skipping layers), sharing its KV cache	No
Medusa	Extra lightweight decoding heads bolted onto the model predict several future tokens at once	No (adds heads)
EAGLE / EAGLE-3	A small draft head autoregresses at the model's internal feature level, reusing the target's features	No (adds a head)
n-gram / prompt-lookup	Copies repeated spans straight from the prompt — no neural draft at all	No

EAGLE-3 (arXiv 2503.01840, NeurIPS 2025) is the current state of the art. Instead of predicting tokens directly, earlier EAGLE drafts at the feature level; EAGLE-3 adds multi-layer feature fusion and direct token prediction, and its acceptance rate keeps climbing as you throw more training data at it. That scaling property is why it has become the default people reach for.

How much faster, really

Bar chart concept showing speedup multipliers for different speculative decoding methods

Setting	Reported speedup	Notes
Original papers (2022-2023)	2-3x	No quality loss, no target retraining
EAGLE-3 (reference)	up to ~6.5x	~4-4.5x on MT-bench / GSM8K, acceptance length ~6 tokens
EAGLE-3 in vLLM	~2-2.5x typical	Real serving gains run below the reference numbers
gpt-oss-120b + EAGLE3, H200 (Red Hat, 2026)	~10-21% throughput	Decode-heavy and code workloads; 3 draft tokens, ~2.07 mean acceptance length; held up to 200 concurrent requests

The single number that drives all of this is the acceptance rate — how often the draft guesses right. That's workload-dependent. Code and math are highly predictable and draft beautifully; open-ended chat is noisier and accepts shorter runs. Your mileage will track your traffic, not the headline figure.

When it doesn't help much

Speculative decoding is fundamentally a latency optimization — it shines on single streams and small batches, exactly the regime where the GPU is memory-bound and has spare compute to burn on verification. Crank the batch size up and that free compute disappears: the GPU is already saturated doing real work for many sequences at once, so verifying speculative tokens stops being free and the gains shrink.

Batching also creates the ragged tensor problem. Different sequences in a batch accept different numbers of tokens, so the neat rectangular tensors GPUs love turn jagged, and handling that adds overhead. That's an active research area, not a solved one.

And it's not free to operate: you have to obtain or train a draft, serve it, tune K, and eat some wasted work whenever the draft guesses wrong.

Trying it

vLLM ships native support and configures it through speculative_config. A minimal EAGLE-3 setup looks like:

speculative_config = {
    "method": "eagle3",
    "model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B",
    "num_speculative_tokens": 3,
}

More speculative tokens isn't automatically better — past a point, deeper drafts get rejected more often and throughput drops, so 2-3 is a common sweet spot worth measuring around. If you don't want to manage a draft model at all, the ngram (prompt-lookup) method speculates by copying repeated spans from the prompt, which works surprisingly well on summarization and code-edit workloads where output echoes input.

Decision rule: small batch and latency-critical, this is close to a free win. High-throughput batched serving, measure first — the gains may be modest or gone.

The whole trick reduces to one sentence: guess ahead with a cheap model, check with the expensive one, and let a rejection-sampling step make the guessing mathematically free in quality. The only thing left for you to decide is whether your batch sizes are small enough to collect the discount.

References

Leviathan, Kalman & Matias, Fast Inference from Transformers via Speculative Decoding (arXiv 2211.17192, ICML 2023)
Chen et al., Accelerating Large Language Model Decoding with Speculative Sampling (arXiv 2302.01318, DeepMind)
EAGLE-3: Scaling up Inference Acceleration via Training-Time Test (arXiv 2503.01840, NeurIPS 2025)
Medusa: Simple LLM Inference Acceleration Framework (arXiv 2401.10774)
Batch Speculative Decoding Done Right (arXiv 2510.22876)
vLLM Speculative Decoding docs
Red Hat Developer — Performance improvements with speculative decoding in vLLM for gpt-oss
Red Hat Developer — Fly Eagle(3) fly
BentoML LLM Inference Handbook — Speculative decoding
Baseten — A quick introduction to speculative decoding

Draft, then verify

The idea is a guess-and-check loop between two models.

The payoff: you spend one expensive forward pass to advance potentially several tokens instead of one. The closer the draft tracks the target, the more tokens clear per pass.

Diagram-style illustration of a small model drafting several tokens and a large model checking them in parallel

Why the output is identical

Where the draft comes from

The original method used a separate small model. Most of the gains since have come from cheaper ways to produce a draft.

Method	How the draft is produced	Needs a separate model?
Separate draft model (classic)	A small standalone model from the same family (e.g. a 1B drafting for a 70B)	Yes
Self-speculative / layer-skipping	The target model runs a reduced version of itself (skipping layers), sharing its KV cache	No
Medusa	Extra lightweight decoding heads bolted onto the model predict several future tokens at once	No (adds heads)
EAGLE / EAGLE-3	A small draft head autoregresses at the model's internal feature level, reusing the target's features	No (adds a head)
n-gram / prompt-lookup	Copies repeated spans straight from the prompt — no neural draft at all	No

How much faster, really

Bar chart concept showing speedup multipliers for different speculative decoding methods

Setting	Reported speedup	Notes
Original papers (2022-2023)	2-3x	No quality loss, no target retraining
EAGLE-3 (reference)	up to ~6.5x	~4-4.5x on MT-bench / GSM8K, acceptance length ~6 tokens
EAGLE-3 in vLLM	~2-2.5x typical	Real serving gains run below the reference numbers
gpt-oss-120b + EAGLE3, H200 (Red Hat, 2026)	~10-21% throughput	Decode-heavy and code workloads; 3 draft tokens, ~2.07 mean acceptance length; held up to 200 concurrent requests

When it doesn't help much

And it's not free to operate: you have to obtain or train a draft, serve it, tune K, and eat some wasted work whenever the draft guesses wrong.

Trying it

vLLM ships native support and configures it through speculative_config. A minimal EAGLE-3 setup looks like:

speculative_config = {
    "method": "eagle3",
    "model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B",
    "num_speculative_tokens": 3,
}

Decision rule: small batch and latency-critical, this is close to a free win. High-throughput batched serving, measure first — the gains may be modest or gone.

References

Leviathan, Kalman & Matias, Fast Inference from Transformers via Speculative Decoding (arXiv 2211.17192, ICML 2023)
Chen et al., Accelerating Large Language Model Decoding with Speculative Sampling (arXiv 2302.01318, DeepMind)
EAGLE-3: Scaling up Inference Acceleration via Training-Time Test (arXiv 2503.01840, NeurIPS 2025)
Medusa: Simple LLM Inference Acceleration Framework (arXiv 2401.10774)
Batch Speculative Decoding Done Right (arXiv 2510.22876)
vLLM Speculative Decoding docs
Red Hat Developer — Performance improvements with speculative decoding in vLLM for gpt-oss
Red Hat Developer — Fly Eagle(3) fly
BentoML LLM Inference Handbook — Speculative decoding
Baseten — A quick introduction to speculative decoding

How Speculative Decoding Speeds Up LLM Inference

Draft, then verify

Why the output is identical

Where the draft comes from

How much faster, really

When it doesn't help much

Trying it

References

Related Posts

FlashAttention Explained: Why Modern LLMs Run Faster

DPO Explained: Aligning LLMs Without RLHF Complexity

DeepSeek-R1 Explained: RL for Reasoning Models

How Speculative Decoding Speeds Up LLM Inference

Draft, then verify

Why the output is identical

Where the draft comes from

How much faster, really

When it doesn't help much

Trying it

References

Related Posts

FlashAttention Explained: Why Modern LLMs Run Faster

DPO Explained: Aligning LLMs Without RLHF Complexity

DeepSeek-R1 Explained: RL for Reasoning Models