Table of Contents
Fetching ...

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Amirhossein Bozorgkhoo, Igor Molybog

Abstract

Speculative decoding is a technique that uses multiple language models to accelerate infer- ence. Previous works have used an experi- mental approach to optimize the throughput of the inference pipeline, which involves LLM training and can be costly. This study of spec- ulative decoding proposes a theory that ana- lytically connects the key hyperparameters of pre-trained LLMs to the throughput efficiency of a downstream SD-based inference system. The theory allows the prediction of throughput- optimal hyperparameters for the components of an inference system before their pre-training.

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Abstract

Speculative decoding is a technique that uses multiple language models to accelerate infer- ence. Previous works have used an experi- mental approach to optimize the throughput of the inference pipeline, which involves LLM training and can be costly. This study of spec- ulative decoding proposes a theory that ana- lytically connects the key hyperparameters of pre-trained LLMs to the throughput efficiency of a downstream SD-based inference system. The theory allows the prediction of throughput- optimal hyperparameters for the components of an inference system before their pre-training.
Paper Structure (39 sections, 25 equations, 5 figures, 9 tables)

This paper contains 39 sections, 25 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Visualization of the fitted affine plane relating the estimated scaling parameter $\alpha$ to draft model perplexity ($x$) and target model perplexity ($y$). Scatter points represent empirical $(x, y, \alpha)$ observations from Tables \ref{['Table:Alpha_perplexity:OPT_drafts']} and \ref{['Table:Alpha_perplexity:Qwen_drafts']}, while the surface corresponds to the least-squares fit of Equation \ref{['eq:plane']}.
  • Figure 2: Throughput (tokens per FLOP) predicted by Equation \ref{['eq:throughput_N']} as a function of draft model size $N$ (in billions of parameters) for different target models and draft model families based on Table \ref{['table:Optimal_draft_model_size_throughput']}. Each curve corresponds to a single target model and is annotated directly on the curve. Black markers indicate the draft model sizes used in experiments, while star markers denote the optimal draft size $N^\ast$ that maximizes predicted throughput for each target model.
  • Figure 3: Curves show estimated $\alpha$ values as a function of draft model perplexity, based on data from Tables \ref{['Table:Alpha_perplexity:OPT_drafts']} and \ref{['Table:Alpha_perplexity:Qwen_drafts']}. The $\alpha$ values were computed based on the method outlined in Appendix \ref{['sec:alpha_CI']}, which estimates them using accepted token statistics. Function details appear in Table \ref{['table:alpha_regression_draft']}.
  • Figure 4: Latency metrics (TTFT, TTOT, and TPOT) for the OPT-13B target model plotted as a function of the normalized deviation $|N - N^\ast| / M$ from the analytically predicted optimal draft size. Each point corresponds to an individual draft model from the OPT, Qwen1.5, or Qwen2.5 families, with error bars indicating 95% confidence intervals across prompts.The vertical dashed line marks the predicted optimum $N^\ast$.Across all metrics and draft families, latency increases with distance from $N^\ast$, supporting the accuracy of the throughput-based optimal draft size prediction.
  • Figure 5: Numerical characterization of the throughput-optimal draft model size $N^\ast$ as a function of the target model size $M$, the draft training dataset size $D$, and the target training dataset size $D^\prime$. Each row presents the raw dependence of $N^\ast$ (left) alongside the corresponding normalized view $N^\ast/M$ (right). The results show that the dominant scaling of the optimal draft size is approximately linear in the target model size, while the effects of the training dataset sizes act as weaker, second-order corrections.