Table of Contents
Fetching ...

Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization

Rahul Krishna Thomas, Arka Pal

TL;DR

This work gives the first multi-draft algorithm with 90% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution, and reduces the exponentially large OTLP to a convex optimization problem in at most V variables.

Abstract

Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step $n$ draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over $V^n$ variables, with $V$ being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so it remains infeasible to solve. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. With a novel application of polymatroid theory, we reduce the exponentially large OTLP to a convex optimization problem in at most $V$ variables. This allows us to devise an algorithm for optimal $n$-draft speculative sampling when the $n$ tokens are chosen i.i.d. from a single draft model, which can be tuned to arbitrary accuracy. Finally, we measure acceptance rates and algorithm runtimes for various $n$ and top-$k$ draft sampling settings. Our findings give the first multi-draft algorithm with 90% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.

Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization

TL;DR

This work gives the first multi-draft algorithm with 90% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution, and reduces the exponentially large OTLP to a convex optimization problem in at most V variables.

Abstract

Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over variables, with being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so it remains infeasible to solve. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. With a novel application of polymatroid theory, we reduce the exponentially large OTLP to a convex optimization problem in at most variables. This allows us to devise an algorithm for optimal -draft speculative sampling when the tokens are chosen i.i.d. from a single draft model, which can be tuned to arbitrary accuracy. Finally, we measure acceptance rates and algorithm runtimes for various and top- draft sampling settings. Our findings give the first multi-draft algorithm with 90% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.

Paper Structure

This paper contains 39 sections, 16 theorems, 104 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

The optimal acceptance rate $\alpha^*$ can be computed as

Figures (5)

  • Figure 1: Optimal acceptance rates from $n$ i.i.d drafts with top-$k$ sampling with $n$ with target/draft pairs of Gemma-2 27B/2B and Llama-3 70/8B. Increasing $k$ improves acceptance rate significantly up to $k=1000$, and increasing $n$ also results in steady increase in optimal acceptance.
  • Figure 2: Comparison of i.i.d. versus greedy acceptance rates for Gemma-2 27B/2B across various choices of $n$ and top-$k$ sampling of the draft.
  • Figure 3: Comparison of i.i.d. versus greedy acceptance rates for Llama-3 70B/8B across various choices of $n$ and top-$k$ sampling of the draft.
  • Figure 4: Optimal acceptance rates from $n$ i.i.d drafts with top-$k$ sampling with $n$ with the target/draft pair Gemma-2 27B/2B, for various target temperature settings ($0.2,0.4,0.6,0.8$). Until temperature $0.8$, increasing $k$ past $10$ results in little acceptance gains for reasonable values of $n$.
  • Figure 5: Solve times and failure rates of global resolution with $\tau=10^{-4}$ (Gemma-2 27B/2B) for choices of $k\in \{10,100,1000,10000\}$ and $n \in \{2,3,4,5\}$, plotted over increasing temperature.

Theorems & Definitions (30)

  • Theorem 3.1: hu2025towards
  • Theorem 4.1
  • Theorem 5.1
  • Theorem 6.1: Outer Residuals
  • Theorem 6.2
  • Lemma 1
  • Theorem 6.3: Outer Convex Solver
  • Theorem 6.4: Inner Convex Solver
  • Lemma 2: Approximation Guarantee
  • Theorem B.1
  • ...and 20 more