Table of Contents
Fetching ...

Improving Multi-candidate Speculative Decoding

Xiaofan Lu, Yixiao Zeng, Feiyang Ma, Zixu Yu, Marco Levorato

TL;DR

This work tackles the efficiency bottleneck in large language model inference by enhancing Multi-Candidate Speculative Decoding (MCSD). It introduces a Target Model Initialized Multi-Candidate Generation, a Dynamic Sliced Topology-Aware Causal Mask, and an Early-Stop Decision Model to improve the acceptance rate $α$ and reduce draft-generation length $γ$ while maintaining quality. Empirical results show a maximum speedup of $27.5\%$ in certain configurations and up to $1.90\times$ on MT-Bench, though the combined dynamic approach can incur quality losses or marginal speedups depending on the draft-target pairing and datasets. Overall, static target-initialized MCSD with a carefully tuned configuration currently offers the best speedup- and quality- preservation tradeoff, with the dynamic components providing valuable insights for future optimization.

Abstract

Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency, Multi-Candidate Speculative Decoding (MCSD) improves upon this by sampling multiple candidate tokens from the draft model at each step and verifying them in parallel, thus increasing the chances of accepting a token and reducing generation time. Existing MCSD methods rely on the draft model to initialize the multi-candidate sequences and use static length and tree attention structure for draft generation. However, such an approach suffers from the draft and target model's output distribution differences, especially in a dynamic generation context. In this work, we introduce a new version of MCSD that includes a target model initialized multi-candidate generation, a dynamic sliced topology-aware causal mask for dynamic length adjustment, and decision models to optimize early stopping. We experimented with our method on Llama 2-7B and its variants and observed a maximum 27.5% speedup compared to our MCSD baseline across three benchmarks with Llama 2-7B as the target model and JackFram 68M as the draft model. Additionally, we evaluate the effects of using the target model initialized multi-candidate process with different draft models on output quality.

Improving Multi-candidate Speculative Decoding

TL;DR

This work tackles the efficiency bottleneck in large language model inference by enhancing Multi-Candidate Speculative Decoding (MCSD). It introduces a Target Model Initialized Multi-Candidate Generation, a Dynamic Sliced Topology-Aware Causal Mask, and an Early-Stop Decision Model to improve the acceptance rate and reduce draft-generation length while maintaining quality. Empirical results show a maximum speedup of in certain configurations and up to on MT-Bench, though the combined dynamic approach can incur quality losses or marginal speedups depending on the draft-target pairing and datasets. Overall, static target-initialized MCSD with a carefully tuned configuration currently offers the best speedup- and quality- preservation tradeoff, with the dynamic components providing valuable insights for future optimization.

Abstract

Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency, Multi-Candidate Speculative Decoding (MCSD) improves upon this by sampling multiple candidate tokens from the draft model at each step and verifying them in parallel, thus increasing the chances of accepting a token and reducing generation time. Existing MCSD methods rely on the draft model to initialize the multi-candidate sequences and use static length and tree attention structure for draft generation. However, such an approach suffers from the draft and target model's output distribution differences, especially in a dynamic generation context. In this work, we introduce a new version of MCSD that includes a target model initialized multi-candidate generation, a dynamic sliced topology-aware causal mask for dynamic length adjustment, and decision models to optimize early stopping. We experimented with our method on Llama 2-7B and its variants and observed a maximum 27.5% speedup compared to our MCSD baseline across three benchmarks with Llama 2-7B as the target model and JackFram 68M as the draft model. Additionally, we evaluate the effects of using the target model initialized multi-candidate process with different draft models on output quality.
Paper Structure (18 sections, 4 equations, 8 figures, 4 tables)

This paper contains 18 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Both the draft-initialized (left) and target-initialized (right) multi-candidate generation processes utilize a token tree configuration with a width of 3 and depth of 2. The execution sequence proceeds as follows: (1) Generate the token tree (shown at the top of each diagram). (2) Transform the token tree into a topology-aware causal mask (represented as a square mask with a check symbol). (3) Generate multi-candidate sequences using the draft model (not shown in the figure). (4) Verify the multi-candidate sequences with the target model by obtaining next-token logits, which are then transformed into distributions. (5) Select the candidate sequence with the longest length after verification. (6) Update the input IDs, key-value cache, and sample new token(s) based on the target model's next-token distributions. Note: In a draft-initialized multi-candidate generation, only one new token is sampled, whereas in a target-initialized multi-candidate generation, multiple new tokens are sampled.
  • Figure 2: $\beta$ denotes the threshold for early stop, in our experiment the $\beta$ = 0.4. The inputs for the decision model are hidden states or output distribution and entropy related to the token. The dynamic multi-candidate speculative decoding process with an early stop decision model and fork-shaped draft model initialized a token tree, where the token tree configuration is $W = 3$ (width) and $D = 3$ (depth). It stopped at the second draft generation turn, where the maximum number of draft generation turns is three.
  • Figure 3: Speedup ratios compared to vanilla inference for different SD methods based on all three datasets under temperature = 0.7. We employ the static tree configuration. The bars represent speedup ratios for different model combinations (draft model + target model)
  • Figure 4: Speed heatmap across various static MCSD configuration
  • Figure 5: Speedup ratios compared to vanilla inference for different SD methods based on all three datasets under dynamic setting and temperature = 1. We employ dynamic tree configuration with $D = 5$ and $W = 16$. This is our optimal configuration based on empirical studies.
  • ...and 3 more figures