Table of Contents
Fetching ...

SSSD: Simply-Scalable Speculative Decoding

Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Niklas Zwingenberger, Lorenz K. Müller, Lukas Cavigelli

TL;DR

SSSD tackles the deployment bottleneck of speculative decoding by removing the need for training or pre-built draft models. It couples a CPU-based n-gram matcher over the prompt and self-output with a continuously updatable datastore (suffix-array backed) to propose draft tokens, while aligning with hardware via a roofline-informed speculation budget. Across multilingual, long-context, and batching scenarios, SSSD achieves up to $2.9\\times$ latency reduction and competitive end-to-end speedups compared to training-based approaches, with minimal adoption effort. This training-free, data-driven approach broadens practical deployment of speculative decoding, reducing maintenance overhead and improving robustness to distribution shifts in real-world serving workloads.

Abstract

Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achieve substantial speedups typically rely on an additional trained draft model or auxiliary model components, increasing deployment and maintenance complexity. This added complexity reduces flexibility, particularly when serving workloads shift to tasks, domains, or languages that are not well represented in the draft model's training data. We introduce Simply-Scalable Speculative Decoding (SSSD), a training-free method that combines lightweight n-gram matching with hardware-aware speculation. Relative to standard autoregressive decoding, SSSD reduces latency by up to 2.9x. It achieves performance on par with leading training-based approaches across a broad range of benchmarks, while requiring substantially lower adoption effort--no data preparation, training or tuning are needed--and exhibiting superior robustness under language and domain shift, as well as in long-context settings.

SSSD: Simply-Scalable Speculative Decoding

TL;DR

SSSD tackles the deployment bottleneck of speculative decoding by removing the need for training or pre-built draft models. It couples a CPU-based n-gram matcher over the prompt and self-output with a continuously updatable datastore (suffix-array backed) to propose draft tokens, while aligning with hardware via a roofline-informed speculation budget. Across multilingual, long-context, and batching scenarios, SSSD achieves up to latency reduction and competitive end-to-end speedups compared to training-based approaches, with minimal adoption effort. This training-free, data-driven approach broadens practical deployment of speculative decoding, reducing maintenance overhead and improving robustness to distribution shifts in real-world serving workloads.

Abstract

Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achieve substantial speedups typically rely on an additional trained draft model or auxiliary model components, increasing deployment and maintenance complexity. This added complexity reduces flexibility, particularly when serving workloads shift to tasks, domains, or languages that are not well represented in the draft model's training data. We introduce Simply-Scalable Speculative Decoding (SSSD), a training-free method that combines lightweight n-gram matching with hardware-aware speculation. Relative to standard autoregressive decoding, SSSD reduces latency by up to 2.9x. It achieves performance on par with leading training-based approaches across a broad range of benchmarks, while requiring substantially lower adoption effort--no data preparation, training or tuning are needed--and exhibiting superior robustness under language and domain shift, as well as in long-context settings.

Paper Structure

This paper contains 23 sections, 3 equations, 6 figures, 2 tables, 3 algorithms.

Figures (6)

  • Figure 1: A representation of the system with the main steps of the SSSD method.
  • Figure 2: (a,b) Roofline-based forward-pass time vs. speculation length, with contributions from linear operations and FlashAttention for Llama-3.1-8B at batch size 8 (same hardware as in Figure \ref{['fig:results_8b']}a). Curves correspond to single operations from Table \ref{['tab:formulas']}. (c) Accepted tokens, normalized cost, and theoretical speedup vs. speculation length.
  • Figure 3: (a) Speculation quality of SSSD data sources on 160 MT-Bench and GSM8K prompts. (b,c) Comparison with parameter-free baselines. REST and SSSD use the same datastore; the Lookahead cache is evaluated both warm (identical data as SSSD) and cold. Solid lines show accepted tokens, dashed lines show candidate retrieval and mask construction time. All experiments use Llama-3.1-8B.
  • Figure 4: Experiments on Qwen3-14B on a 48 GB GPU (165 TFLOPS bfloat16, 950 GB/s measured), evaluated on MT-Bench (and translations). Temperature = 0.7, top-p = 0.8, top-k = 20; averages over five runs. (a, b) Speedup over autoregressive decoding at batch size 1 versus datastore size ($\approx$1,000 tokens per user conversation on average). (c) Corresponding acceptance length, comparing model-generated and dataset-derived entries.
  • Figure 5: Evaluation of speculation methods on 8B models. (a--d) Llama-3.1-8B on a single GPU (same setup as Figure \ref{['fig:qwen_speedup']}) across multiple datasets. (e) Llama-3.1-8B in disaggregated prefill–decode mode on 4 GPUs (1 for decoding; 80 GB VRAM, 710 TFLOPS measured, 3.1 TB/s measured bandwidth). (f) DeepSeek-R1-Distill-Llama-8B on MATH-500 using the same hardware as in (a), averaged over five runs.
  • ...and 1 more figures