Table of Contents
Fetching ...

Speculative Sampling via Exponential Races

Szymon Kobus, Deniz Gündüz

TL;DR

The paper addresses accelerating autoregressive text generation by linking speculative decoding to channel simulation and introducing Exponential Race Speculative Decoding (ERSD). It shows that the speed-up from drafting is governed by the entropy of the acceptance distribution $H[R]$, with a lower bound via $D_{KL}(P||Q)$ and an upper bound arising from channel-simulation schemes, yielding the bound $\mathbb{E}[\#generated] \le (\log|\Omega| + \log(k+1)) / H[R]$. It develops an optimal drafting-tree algorithm with $O(k\log k)$ complexity and connects the drafting process to Tunstall coding, providing an asymptotic interpretation of speed-ups in terms of source compression. Empirically, ERSD matches state-of-the-art speculative decoding performance and reveals that higher-order dependencies can influence practical outcomes, highlighting both theoretical insight and avenues for further improvement.

Abstract

Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative decoding and channel simulation, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative decoding. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens $k$ generated by the draft model for large $k$, which serves as an upper bound for all $k$. We also propose a novel speculative decoding method via exponential race ERSD that matches state-of-the-art performance.

Speculative Sampling via Exponential Races

TL;DR

The paper addresses accelerating autoregressive text generation by linking speculative decoding to channel simulation and introducing Exponential Race Speculative Decoding (ERSD). It shows that the speed-up from drafting is governed by the entropy of the acceptance distribution , with a lower bound via and an upper bound arising from channel-simulation schemes, yielding the bound . It develops an optimal drafting-tree algorithm with complexity and connects the drafting process to Tunstall coding, providing an asymptotic interpretation of speed-ups in terms of source compression. Empirically, ERSD matches state-of-the-art speculative decoding performance and reveals that higher-order dependencies can influence practical outcomes, highlighting both theoretical insight and avenues for further improvement.

Abstract

Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative decoding and channel simulation, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative decoding. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens generated by the draft model for large , which serves as an upper bound for all . We also propose a novel speculative decoding method via exponential race ERSD that matches state-of-the-art performance.

Paper Structure

This paper contains 10 sections, 4 theorems, 15 equations, 5 figures, 5 algorithms.

Key Result

Lemma 3.1

For with target distribution $P=P(\,\cdot\mid x_{:n})$ and draft distribution $Q=Q(\,\cdot\mid x_{:n})$, the probability of accepting the first drafted token, $P_{accept}^{(1)}$, satisfies: where $D_{HM}$ denotes the harmonic mean distance, defined as:

Figures (5)

  • Figure 1: Speculative decoding trees: a black draft tree overlaid with a green/red gsd decision tree. Black vertices and arrows represent the draft tree; each vertex is a drafted token, and paths from the gray root are potential text continuations. Green/red arrows show gsd acceptance/rejection decisions. Blue leaf vertices signify sampling from a distribution. (Note: ersd does not follow the same decision tree.)
  • Figure 2: Illustration of exponential races for speculative decoding. Each bar represents a potential next token, with height corresponding to arrival time. The first arrival under the draft model distribution $Q$ (left) predicts the first arrival under the target model distribution $P$ (right).
  • Figure 3: Expected number of accepted tokens as a function of the number of drafted tokens for sequence , batch , $\tau^*$tree (optimal), and SpecInfer tree drafting strategies, for and . Results shown for draft model $Q$ Llama-3.2-1B, and target model $P$ Llama-3.1-70B-Instruct.
  • Figure 4: Marginal probability of acceptance as a function of the number of drafted tokens for sequence , batch , $\tau^*$tree (optimal) drafting strategies, for and . Results shown for draft model $Q$ Llama-3.2-1B, and target model $P$ Llama-3.1-70B-Instruct.
  • Figure : Simple ($k=1$)

Theorems & Definitions (4)

  • Lemma 3.1
  • Theorem 4.1
  • Lemma 4.2
  • Lemma B.1