Table of Contents
Fetching ...

Speculative Speculative Decoding

Tanishq Kumar, Tri Dao, Avner May

TL;DR

This work introduces speculative speculative decoding (SSD) to parallelize operations between speculation and verification, and identifies three key challenges presented by speculative speculative decoding, and suggests principled methods to solve each.

Abstract

Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them in parallel with a single target model forward pass. However, speculative decoding itself relies on a sequential dependence between speculation and verification. We introduce speculative speculative decoding (SSD) to parallelize these operations. While a verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely. We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each. The result is Saguaro, an optimized SSD algorithm. Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.

Speculative Speculative Decoding

TL;DR

This work introduces speculative speculative decoding (SSD) to parallelize operations between speculation and verification, and identifies three key challenges presented by speculative speculative decoding, and suggests principled methods to solve each.

Abstract

Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them in parallel with a single target model forward pass. However, speculative decoding itself relies on a sequential dependence between speculation and verification. We introduce speculative speculative decoding (SSD) to parallelize these operations. While a verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely. We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each. The result is Saguaro, an optimized SSD algorithm. Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.
Paper Structure (50 sections, 10 theorems, 54 equations, 13 figures, 1 table, 1 algorithm)

This paper contains 50 sections, 10 theorems, 54 equations, 13 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

leviathan2023

Figures (13)

  • Figure 1: (Left) Ordinary speculative decoding (SD) requires the verifier to wait idly for the draft to speculate. (Center) In our algorithm, speculation runs on a separate device (1$\times$H100) in parallel with verification; the draft precomputes speculations for many possible verification outcomes and returns the speculated tokens immediately if one occurs. (Right) End-to-end performance of SSD, SD, and autoregressive (AR) decoding averaged over four datasets spanning math, code and chat, for Llama-3.1-70B on TP=4 H100s, batch size 1, greedy decoding, Llama-3.2-1B draft model.
  • Figure 2: Schematic of speculation cache strategy. We allocate fan-out$F_k$ (bonus token guesses) over sequence length $K+1$ according to Theorem \ref{['thm:cache_topology']}.
  • Figure 3: Scaling of cache hit rates with fan out. (Left, middle) The rejection rates (1 - $p_{hit,*}(F)$) conditioned on the prior speculation coming from the primary vs backup speculator, respectively. Rejection rates (cache misses) fall as a power law in the draft-fan out, demonstrating that cache hit rates increase with cache size. (Right) The overall cache hit rate $p_\text{hit}(F)$.
  • Figure 4: Advantage of geometric fan out strategy increases at higher temperatures, improving both speculation cache hit rate (right) and thus end-to-end speed (left). Results averaged over four datasets. At all temperatures, SSD with either fan out strategy outperforms ordinary speculative decoding.
  • Figure 5: We introduce Saguaro sampling, a novel sampling scheme designed specifically for SSD. (Left) It interpolates between high cache hit rate and high speculative acceptance rate. (Right) Illustrative schematic for how Saguaro sampling increases residual probability mass on the top draft tokens, encouraging the sampled bonus token to lie in the speculation cache by construction.
  • ...and 8 more figures

Theorems & Definitions (21)

  • Theorem 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Theorem 7
  • Corollary 8
  • Corollary 9
  • Definition 10
  • ...and 11 more