Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

Jonathan Mamou; Oren Pereg; Daniel Korat; Moshe Berchansky; Nadav Timor; Moshe Wasserblat; Roy Schwartz

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz

TL;DR

This work shows that the common practice of using the same SL for all iterations (static SL) is suboptimal, and introduces DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL.

Abstract

Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reaches an average speedup of 10% compared to the best static SL baseline, while generating the exact same text.

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

TL;DR

Abstract

Paper Structure (25 sections, 2 equations, 9 figures, 6 tables)

This paper contains 25 sections, 2 equations, 9 figures, 6 tables.

Introduction
Background: Speculative Decoding
Speculation lookahead (SL)
Dynamic Speculation Lookahead
Finding the optimal SL per iteration
DynamIc SpeCulation lookahead Optimization
Experiments
Datasets and Models
SL classifier training
Baselines and Results
Related Work
Conclusion
Oracle SL Analysis
Datasets and Prompts Details
MBPP
...and 10 more sections

Figures (9)

Figure 1: An illustration of a single speculative decoding iteration with Speculation Lookahead (SL) = 5. Given a prompt $t_0$, a draft model autoregressively generates 5 tokens $t_1, \ldots, t_5$. The target model validates them all in parallel and accepts only $t_1, t_2, t_3$. As $t_4$ and $t_5$ are rejected, the SL is suboptimal (too large).
Figure 2: Oracle and static SL values for different speculative iterations on one MBPP example. For static SL, we run 38 target forward passes and 192 draft forward passes, while for oracle SL, we only run 27 target forward passes and 129 draft forward passes. We observe a high variance of oracle SL values.
Figure 3: The average oracle SL over the normalized index of the speculative iterations for the Alpaca dataset. We observe a high variance of oracle SL values.
Figure 4: Oracle SL probability histogram on the different datasets. We observe a high variance of SL values.
Figure 5: Bar chart of the average oracle SL. Note that the iteration index seems to have low predictive power for the oracle SL.
...and 4 more figures

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

TL;DR

Abstract

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)