Table of Contents
Fetching ...

SpecTr: Fast Speculative Decoding via Optimal Transport

Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, Felix Yu

TL;DR

SpecTr reframes speculative decoding as a discrete optimal transport problem with a membership cost, enabling principled analysis and efficient multi-draft strategies. By introducing OTM and practical approximations (k-Seq) plus a full autoregressive sampler (SpecTr), the approach preserves large-model output quality while substantially speeding up decoding. Theoretical guarantees (≤(1−1/e) factors) and near-linear-time algorithms translate into real-world speedups on large language models, demonstrated by significant wall-clock improvements over baseline and single-draft speculative methods. Overall, SpecTr advances fast, provably correct autoregressive sampling through a tight coupling of transport theory and draft-based acceleration.

Abstract

Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up sampling is $\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$ (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. A subset of the tokens in the draft are accepted (and the rest rejected) based on a statistical method to guarantee that the final output follows the distribution of the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with $\textit{membership cost}$. This framework can be viewed as an extension of the well-known $\textit{maximal-coupling}$ problem. This new formulation enables us to generalize the speculative decoding method to allow for a set of $k$ candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal draft selection algorithm (transport plan) can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose a valid draft selection algorithm whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token. Using this $new draft selection$ algorithm, we develop a new autoregressive sampling algorithm called $\textit{SpecTr}$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.

SpecTr: Fast Speculative Decoding via Optimal Transport

TL;DR

SpecTr reframes speculative decoding as a discrete optimal transport problem with a membership cost, enabling principled analysis and efficient multi-draft strategies. By introducing OTM and practical approximations (k-Seq) plus a full autoregressive sampler (SpecTr), the approach preserves large-model output quality while substantially speeding up decoding. Theoretical guarantees (≤(1−1/e) factors) and near-linear-time algorithms translate into real-world speedups on large language models, demonstrated by significant wall-clock improvements over baseline and single-draft speculative methods. Overall, SpecTr advances fast, provably correct autoregressive sampling through a tight coupling of transport theory and draft-based acceleration.

Abstract

Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up sampling is : use a small model to sample a (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. A subset of the tokens in the draft are accepted (and the rest rejected) based on a statistical method to guarantee that the final output follows the distribution of the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with . This framework can be viewed as an extension of the well-known problem. This new formulation enables us to generalize the speculative decoding method to allow for a set of candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal draft selection algorithm (transport plan) can be computed via linear programming, whose best-known runtime is exponential in . We then propose a valid draft selection algorithm whose acceptance probability is -optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token. Using this algorithm, we develop a new autoregressive sampling algorithm called , which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
Paper Structure (33 sections, 6 theorems, 57 equations, 6 figures, 4 tables, 4 algorithms)

This paper contains 33 sections, 6 theorems, 57 equations, 6 figures, 4 tables, 4 algorithms.

Key Result

Lemma 1

(Appendix app:props) The optimal acceptance probability statisfies the following properties.

Figures (6)

  • Figure 1: One iteration of speculative decoding leviathan2022fastchen2023accelerating. Tokens in blue are decoded tokens from previous iterations, which are used as context for the current iteration. Tokens in red are drafts from the small model based on the context. The underlined tokens are the newly decoded tokens in the current iteration, where underlined red tokens represent tokens selected from the draft and underlined green token is selected from the residual distribution.
  • Figure 2: One iteration of SpecTr. Tokens in blue are decoded tokens from previous iterations, which are used as context for the current iteration. Tokens in red are drafts from the small model based on the context. The underlined tokens are the newly decoded tokens in the current iteration, where underlined red tokens represent tokens selected from the draft and underlined green token is selected from the residual distribution. See \ref{['fig:spectr_main']} for a more detailed run of the draft selection step.
  • Figure 3: An example run of the sequence-level draft selection in SpecTr with $L=4$ and $4$ draft sequences. In the first step, there are $4$ drafts tokens, and the token-level draft selection algorithm selects the word 'be' which appeared thrice. Note that all tokens following 'be' are valid draft tokens from the small model. In the second step, there are 3 drafts and the selection algorithm selects 'liked'. The next token-level selection algorithm will have two drafts ('by' and 'for') and it selects 'by'. Finally, there is only one draft following 'by', and the selection algorithm doesn't select it and outputs 'three' as a correction. The process ends and a total of $4$ tokens are generated.
  • Figure 4: Acceptance probability comparison between OTM-$k$ and $\textsc{k-Seq}$ when $p=\text{Ber}(0.25)$ and $q = \text{Ber}(b)$.
  • Figure 5: Optimal acceptance probability ($\alpha_k$) as a function of $k$ when $p=U(d)$ for $d=120$ and $q = U(d_q)$.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Definition 1: Coupling
  • Claim 1
  • Definition 2: Optimal Transport (OT) villani2009optimal
  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Theorem 2
  • Lemma 3
  • Lemma 4
  • proof