Table of Contents
Fetching ...

SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

Ryan Sun, Tianyi Zhou, Xun Chen, Lichao Sun

TL;DR

SpecHub is presented, a novel, efficient sampling-verification method for MDSD that improves acceptance rates with only linear computational overhead by simplifying the OTM problem into a compact Linear Programming model, which significantly reduces computational complexity.

Abstract

Large Language Models (LLMs) have become essential in advancing natural language processing (NLP) tasks, but their sequential token generation limits inference speed. Multi-Draft Speculative Decoding (MDSD) offers a promising solution by using a smaller draft model to generate multiple token sequences, which the target LLM verifies in parallel. However, current heuristic approaches, such as Recursive Rejection Sampling (RRS), suffer from low acceptance rates in subsequent drafts, limiting the advantages of using multiple drafts. Meanwhile, Optimal Transport with Membership Cost (OTM) can theoretically improve acceptance rates, but its computational cost is too high for real-time use. We present SpecHub, a novel, efficient sampling-verification method for MDSD that improves acceptance rates with only linear computational overhead. By simplifying the OTM problem into a compact Linear Programming model, SpecHub significantly reduces computational complexity. It further accelerates sampling by leveraging a sparse joint distribution, focusing computation on high-probability token sequences. In extensive experiments, Spechub consistently generates 0.05-0.27 and 0.02-0.16 more tokens per step than RRS and RRS without replacement. We attach our code at \url{https://github.com/MasterGodzilla/Speculative_decoding_OT}.

SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

TL;DR

SpecHub is presented, a novel, efficient sampling-verification method for MDSD that improves acceptance rates with only linear computational overhead by simplifying the OTM problem into a compact Linear Programming model, which significantly reduces computational complexity.

Abstract

Large Language Models (LLMs) have become essential in advancing natural language processing (NLP) tasks, but their sequential token generation limits inference speed. Multi-Draft Speculative Decoding (MDSD) offers a promising solution by using a smaller draft model to generate multiple token sequences, which the target LLM verifies in parallel. However, current heuristic approaches, such as Recursive Rejection Sampling (RRS), suffer from low acceptance rates in subsequent drafts, limiting the advantages of using multiple drafts. Meanwhile, Optimal Transport with Membership Cost (OTM) can theoretically improve acceptance rates, but its computational cost is too high for real-time use. We present SpecHub, a novel, efficient sampling-verification method for MDSD that improves acceptance rates with only linear computational overhead. By simplifying the OTM problem into a compact Linear Programming model, SpecHub significantly reduces computational complexity. It further accelerates sampling by leveraging a sparse joint distribution, focusing computation on high-probability token sequences. In extensive experiments, Spechub consistently generates 0.05-0.27 and 0.02-0.16 more tokens per step than RRS and RRS without replacement. We attach our code at \url{https://github.com/MasterGodzilla/Speculative_decoding_OT}.

Paper Structure

This paper contains 43 sections, 5 theorems, 28 equations, 9 figures, 8 tables, 3 algorithms.

Key Result

Theorem 1

For a given joint draft distribution $Q$ and target distribution $p$, the optimal solution of the simplified LP formulation achieves the same transport cost as the maximal coupling in the Optimal Transport with Membership Cost (OTM) problem, i.e., $1- \sum_{x_{1:k}\in\mathcal{V}^k}\sum_{i=1}^k \pi(

Figures (9)

  • Figure 1: Batch efficiency of SpecHub, RRS, and RRSw with different numbers of nodes in a binary token tree with temperature $T=1.0$.
  • Figure 2: An example of a token tree of depth $d=4$ for MDSD. The tree is generated sequentially with the draft model and evaluated concurrently with the target model. Each path in the tree corresponds to a potential sequence of tokens, with accepted tokens and rejected tokens highlighted. The black arrows indicate tokens that were not visited. The dashed line represents a sample drawn from the residual distribution after all drafts are rejected. Our paper focuses on the evaluation of one step, how we choose to sample the $k=2$ tokens " dinner" and " to" from the draft distribution $q(\cdot|\text{"I want"})$ and decide which of them to get accepted based on the target probabilities $p(\text{" dinner"}|\text{"I want"})$ and $p(\text{" to"}|\text{"I want"})$.
  • Figure 3: An illustration of rejection sampling. Sampling from the draft distribution gives a point under the blue distribution $q$. If the sample is also under the overlap with the target distributions $p$, we accept it. If not, we reject the token and sample from the residual distribution, the remaining unexplored area $\max(p-q, 0)$ normalized. The misalignment of the residual distribution and draft distribution makes Recursive Rejection Sampling (RRS) inefficient in proceeding runs.
  • Figure 4: An illustration comparing the Optimal Transport with Membership Cost (OTM) framework and SpecHub. In both (a) and (b), the left side shows a two-draft joint sampling distribution, while the right side depicts the target distribution. The yellow bars highlight the token of interest in the target. In (a), OTM requires solving for the transport map $\pi$ of a dense sampling distribution like $Q = q^{\otimes 2}$, which is computationally expensive. In (b), SpecHub simplifies this process by sparsifying the joint distribution, significantly reducing the complexity of solving for $\pi$.
  • Figure 5: A comparison of an optimal solution to an RRSw solution under the LP formulation. Here, the draft distribution $q = [0.5,0.3,0.2]$ and the target distribution $p = [0.1, 0.6, 0.3]$. Each number on the top of the cell is $Q(x_1, x_2)$, and the numbers at the bottom of the cell show $\pi(x_1,x_2, x_1)$ and $\pi(x_1,x_2, x_2)$, i.e. how much of those draft probabilities are transferred to the target probability. RRSw has a transport cost of $0.06$ for not generating enough token 'b'.
  • ...and 4 more figures

Theorems & Definitions (10)

  • Theorem 1: Equivalence of LP to OTM
  • proof
  • proof
  • Theorem 2
  • proof
  • Corollary 1: Top Token Acceptance
  • Theorem 3: Superiority over RRS
  • proof
  • Theorem 4: Superiority over OTM
  • proof