Table of Contents
Fetching ...

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun

TL;DR

The paper tackles the slow inference of autoregressive LLMs by marrying speculative decoding with beam sampling through dynamic-width adjustments. It introduces DSBD, a framework that constructs a draft forest from a small model, then verifies and refines multiple beams in parallel using a large model, with an adaptive width per layer to balance efficiency and quality. Key contributions include a draft-and-verify scheme, dynamic-width beam adaptation, and forest-based parallel verification, plus a memory-cost reduction that maintains a single KV cache. Empirically, DSBD achieves substantial speedups and energy savings over traditional beam sampling while delivering higher-quality outputs and maintaining comparable memory footprints, making it practical for real-world LLM inference.

Abstract

Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, whereas beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model's distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling. To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model's distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling...

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

TL;DR

The paper tackles the slow inference of autoregressive LLMs by marrying speculative decoding with beam sampling through dynamic-width adjustments. It introduces DSBD, a framework that constructs a draft forest from a small model, then verifies and refines multiple beams in parallel using a large model, with an adaptive width per layer to balance efficiency and quality. Key contributions include a draft-and-verify scheme, dynamic-width beam adaptation, and forest-based parallel verification, plus a memory-cost reduction that maintains a single KV cache. Empirically, DSBD achieves substantial speedups and energy savings over traditional beam sampling while delivering higher-quality outputs and maintaining comparable memory footprints, making it practical for real-world LLM inference.

Abstract

Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, whereas beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model's distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling. To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model's distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling...
Paper Structure (26 sections, 2 theorems, 19 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 2 theorems, 19 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

Correctness of Draft and Verification Scheme. Let $\mathcal{I}=\{x_{1:t}^{(1)},\cdots,x_{1:t}^{(W_L)}\}$ denote input beams, $\mathcal{S}=\{x_{1:t+1}^{(1)},\cdots,x_{1:t+1}^{(W_S)}\}$ denote draft beams, and $\mathcal{O}=\{\tilde{x}_{1:t+1}^{(1)},\cdots,\tilde{x}_{1:t+1}^{(W_L)}\}$ denote the output

Figures (9)

  • Figure 1: Examples of greedy and beam sampling. Some nodes are omitted in the figures. Assume the sampling probability is warped to always sample the tokens with the largest probabilities. Given $prefix$ "h", multinomial sampling generates "hello" with an average perplexity of 1.55. Beam sampling generates "happy" with an average perplexity of 1.49.
  • Figure 2: Illustration of one iteration of Speculative Beam Decoding. (a) Draft Stage: given the input beams "who" and "why", the small model first generates a trace of beam sampling. (b)(c): Verification Stage. When verify the first draft layer, "who are" and "why do" are accepted, while "why is" is rejected. When verify the second draft layer, "why is it" is directly rejected because its parent is rejected. Then "who are they" is accepted, while "who are it" is rejected. And another beam "who are you" is sampled from the residual distribution.
  • Figure 3: Illustration of forest-based parallel decoding. Given the draft forest, the large model converts the two trees into sequences in depth-first search order and verifies them in parallel with the topology-aware attention mask. Empty cells in the matrices indicate that attention is masked.
  • Figure 4: Evaluation on SQuAD. Exact match (EM) is higher the better. The blue points represent performances of DSBD under different parameter settings $(\gamma,W_S,t)$. The blue and yellow lines mark the Pareto fronts of DSBD and beam sampling. (SpD: SpecDecode, SI: SpecInfer)
  • Figure 5: Evaluation on Spider. Execution accuracy (EA) is higher the better. The blue points represent performances of DSBD under different parameter settings $(\gamma,W_S,t)$. The blue and yellow lines mark the Pareto fronts of DSBD and beam sampling. (SpD: SpecDecode, SI: SpecInfer)
  • ...and 4 more figures

Theorems & Definitions (4)

  • Theorem 3.1
  • Theorem 3.2
  • proof
  • proof