Table of Contents
Fetching ...

Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation

Lujun Gui, Bin Xiao, Lei Su, Weipeng Chen

TL;DR

FSPAD tackles the bottleneck of lossless speculative decoding by enhancing the draft phase with Feature Sampling and reducing training interference via Partial Alignment Distillation, integrated into the EAGLE-2 framework. By sampling token-embedding-informed features in a high-dimensional space and decoupling feature and logit learning, FSPAD achieves higher average accepted tokens $\tau$ and speedup ratios $SR$ across Vicuna and LLaMA3-Instruct models on Spec-Bench tasks, under both greedy and non-greedy decoding. The approach requires only modest extra parameters and training overhead, yielding consistent improvements over state-of-the-art baselines such as EAGLE-2, Medusa, PLD, and Lookahead, and demonstrating robustness across diverse tasks including multi-turn dialogue, translation, summarization, QA, math reasoning, and retrieval-augmented generation. The results suggest that attention to feature-level representations and training dynamics can significantly bolster lossless speculative decoding without sacrificing output fidelity, enabling faster inference for large-scale LLMs in practical settings.

Abstract

Lossless speculative decoding accelerates target large language model (LLM) inference by employing a lightweight draft model for generating tree-structured candidates, which are subsequently verified in parallel by the target LLM. Currently, effective approaches leverage feature-level rather than token-level autoregression within the draft model to facilitate more straightforward predictions and enhanced knowledge distillation. In this paper, we reassess these approaches and propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding), which introduces two straightforward and effective components within the existing framework to boost lossless speculative decoding. Firstly, FSPAD utilizes token embeddings to sample features of the target LLM in high-dimensional space before feeding them into the draft model, due to the inherent uncertainty of the features preventing the draft model from obtaining the specific token output by the target LLM. Secondly, FSPAD introduces partial alignment distillation to weaken the draft model's connection between features and logits, aiming to reduce the conflict between feature alignment and logit confidence during training. Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series, as well as tasks in multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation. The results show that FSPAD outperforms the state-of-the-art method across all the aforementioned tasks and target LLMs.

Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation

TL;DR

FSPAD tackles the bottleneck of lossless speculative decoding by enhancing the draft phase with Feature Sampling and reducing training interference via Partial Alignment Distillation, integrated into the EAGLE-2 framework. By sampling token-embedding-informed features in a high-dimensional space and decoupling feature and logit learning, FSPAD achieves higher average accepted tokens and speedup ratios across Vicuna and LLaMA3-Instruct models on Spec-Bench tasks, under both greedy and non-greedy decoding. The approach requires only modest extra parameters and training overhead, yielding consistent improvements over state-of-the-art baselines such as EAGLE-2, Medusa, PLD, and Lookahead, and demonstrating robustness across diverse tasks including multi-turn dialogue, translation, summarization, QA, math reasoning, and retrieval-augmented generation. The results suggest that attention to feature-level representations and training dynamics can significantly bolster lossless speculative decoding without sacrificing output fidelity, enabling faster inference for large-scale LLMs in practical settings.

Abstract

Lossless speculative decoding accelerates target large language model (LLM) inference by employing a lightweight draft model for generating tree-structured candidates, which are subsequently verified in parallel by the target LLM. Currently, effective approaches leverage feature-level rather than token-level autoregression within the draft model to facilitate more straightforward predictions and enhanced knowledge distillation. In this paper, we reassess these approaches and propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding), which introduces two straightforward and effective components within the existing framework to boost lossless speculative decoding. Firstly, FSPAD utilizes token embeddings to sample features of the target LLM in high-dimensional space before feeding them into the draft model, due to the inherent uncertainty of the features preventing the draft model from obtaining the specific token output by the target LLM. Secondly, FSPAD introduces partial alignment distillation to weaken the draft model's connection between features and logits, aiming to reduce the conflict between feature alignment and logit confidence during training. Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series, as well as tasks in multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation. The results show that FSPAD outperforms the state-of-the-art method across all the aforementioned tasks and target LLMs.
Paper Structure (18 sections, 2 equations, 6 figures, 3 tables)

This paper contains 18 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The number of tokens generated per step by Vicuna 33B during greedy decoding in tasks of multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation. In this paper, we exclusively compare lossless speculative decoding methods to ensure that the distribution of the output text remains unchanged.
  • Figure 2: The challenge of addressing the inherent uncertainty while preserving the regular pattern of the feature sequence. Different token components on varying elements in $p_{Speculative}$. However, for feature $f_{Speculative}$, the situation becomes more complex.
  • Figure 3: Accuracy and feature-level loss during the training process, where $w$ represents the coefficient of the logit-level loss, and PAD stands for Partial Alignment Distillation in FSPAD.
  • Figure 4: Overview of draft model based speculative decoding.
  • Figure 5: Schematic representation of the drafting phase for EAGLE-2 and FSPAD. $e$ denotes token embeddings, $f$ signifies the features, and $\eta$ represents the inputs of the draft model, with subscripts indicating their positions in the sequence. The red border indicates the predictions of the draft model used for the next step. The green border indicates the inputs of the draft model for the next step.
  • ...and 1 more figures