What Speculative Decoding Actually Does

LLM inference is slow because it's autoregressive: each token is generated sequentially, dependent on every previous token. You can't parallelize the generation itself — but you can parallelize the verification of candidate tokens. Speculative decoding exploits this asymmetry. A small, fast draft model generates several candidate tokens at once. The large target model then verifies all candidates in a single forward pass — which is cheaper than generating them one at a time. Tokens that match the target model's distribution are accepted; the first mismatch is rejected and replaced with the target model's output. In the best case, you get 3–4 tokens accepted per draft-verify cycle instead of 1. Throughput improves proportionally. In the worst case, most drafts are rejected and you pay the overhead of running both models — slower than the target model alone.

Choosing Your Draft Model

Draft model selection is the highest-leverage decision in a speculative decoding deployment. The draft model needs to be fast (2–5% of the target model's parameter count is typical), and its output distribution needs to be close enough to the target that most drafts are accepted. The standard approach is to use a smaller model from the same family — Llama 3.2 1B as a draft for Llama 3.3 70B, for example. Same tokenizer, same pre-training distribution, architecturally similar. Acceptance rates of 70–85% are achievable. Cross-family drafting (using a different model's small version as a draft) typically gives acceptance rates of 40–60%, which often isn't worth the complexity. Same family, same tokenizer, much smaller — that's the recipe that actually works.

Batch Size and GPU Utilization

Speculative decoding's throughput gains are batch-size dependent in a counterintuitive way. At batch size 1 (single-user inference), gains are highest — 2–4× on typical tasks. At large batch sizes (32+), gains diminish significantly, sometimes to near zero. The reason: large batches already saturate GPU compute. The target model's single-forward-pass verification isn't cheap if the GPU is already at capacity. Speculative decoding's value comes from being compute-bound on generation and memory-bandwidth-bound on verification — which only holds when batch sizes are small. For interactive applications (chatbots, copilots), where batch sizes are inherently small and latency matters most, speculative decoding is almost always worth deploying. For offline batch processing, it's rarely worth the operational complexity.

"Speculative decoding is a latency optimization, not a throughput optimization. If your bottleneck is throughput on large batches, look elsewhere."

Implementation Gotchas

The gotchas we found deploying speculative decoding at scale: Temperature matters. At temperature 0, acceptance rates are near 100% — the draft and target are both greedy and usually agree. At temperature 1.0, acceptance rates can drop below 50% because the target model's multinomial sampling diverges from the draft's predictions. We adjust speculation depth (number of draft tokens) based on the temperature in use. KV cache management doubles in complexity. You're maintaining two KV caches — draft and target — and need to handle rollbacks correctly when drafts are rejected. Bugs here are subtle and correctness issues may not surface in testing at low rejection rates. Latency variance increases. The time per token becomes more variable, because cycles with high acceptance rates are fast and cycles with low acceptance are slow. P99 latency may be worse than without speculative decoding even if P50 improves.

Results and Recommendations

In our production deployments, speculative decoding consistently delivers 2–3.5× throughput improvements for interactive single-user inference at temperature ≤ 0.7. The highest gains are on repetitive or templated outputs (code generation, structured extraction) where the draft model has high accuracy. Our recommendation: if you're deploying a model larger than 30B parameters for interactive use, implement speculative decoding. Use a same-family model at 1/20th to 1/50th the parameter count as your draft. Start with a draft length of 4–5 tokens. Measure acceptance rates and adjust draft length to maximize efficiency. If you're running batch inference or high-temperature generation, spend your optimization budget elsewhere — speculative decoding won't help you here.