Table of Contents
Fetching ...

Steering Pretrained Drafters during Speculative Decoding

Frédéric Berdoz, Peer Rheinboldt, Roger Wattenhofer

TL;DR

This work addresses latency in autoregressive LLM inference by enhancing speculative decoding with a pretrained drafter. It introduces SD2, a lightweight dynamic steering mechanism that computes a steering vector from the verifier's hidden states and injects it into the drafter, increasing token acceptance and throughput while keeping overhead negligible. Training uses synthetic verifier data and a KL-divergence objective to align the drafter with the verifier, while the verifier remains fixed; results show up to 35% more tokens accepted and up to 22% higher throughput compared to distillation across several verifier-drafter pairs and tasks. SD2 is retrofit-friendly, preserving performance on long sequences and demonstrating robustness to distribution shifts, making it a practical upgrade for existing speculative decoding pipelines.

Abstract

Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification. Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall effectiveness. While small drafting heads trained from scratch compensate with speed, they struggle when verification dominates latency or when inputs are out of distribution. In contrast, pretrained drafters, though slower, achieve higher acceptance rates thanks to stronger standalone generation capabilities, making them competitive when drafting latency is negligible relative to verification or communication overhead. In this work, we aim to improve the acceptance rates of pretrained drafters by introducing a lightweight dynamic alignment mechanism: a steering vector computed from the verifier's hidden states and injected into the pretrained drafter. Compared to existing offline alignment methods such as distillation, our approach boosts the number of accepted tokens by up to 35\% under standard sampling and 22\% under greedy sampling, all while incurring negligible computational overhead. Importantly, our approach can be retrofitted to existing architectures and pretrained models, enabling rapid adoption.

Steering Pretrained Drafters during Speculative Decoding

TL;DR

This work addresses latency in autoregressive LLM inference by enhancing speculative decoding with a pretrained drafter. It introduces SD2, a lightweight dynamic steering mechanism that computes a steering vector from the verifier's hidden states and injects it into the drafter, increasing token acceptance and throughput while keeping overhead negligible. Training uses synthetic verifier data and a KL-divergence objective to align the drafter with the verifier, while the verifier remains fixed; results show up to 35% more tokens accepted and up to 22% higher throughput compared to distillation across several verifier-drafter pairs and tasks. SD2 is retrofit-friendly, preserving performance on long sequences and demonstrating robustness to distribution shifts, making it a practical upgrade for existing speculative decoding pipelines.

Abstract

Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification. Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall effectiveness. While small drafting heads trained from scratch compensate with speed, they struggle when verification dominates latency or when inputs are out of distribution. In contrast, pretrained drafters, though slower, achieve higher acceptance rates thanks to stronger standalone generation capabilities, making them competitive when drafting latency is negligible relative to verification or communication overhead. In this work, we aim to improve the acceptance rates of pretrained drafters by introducing a lightweight dynamic alignment mechanism: a steering vector computed from the verifier's hidden states and injected into the pretrained drafter. Compared to existing offline alignment methods such as distillation, our approach boosts the number of accepted tokens by up to 35\% under standard sampling and 22\% under greedy sampling, all while incurring negligible computational overhead. Importantly, our approach can be retrofitted to existing architectures and pretrained models, enabling rapid adoption.

Paper Structure

This paper contains 34 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of different drafting paradigms: Independent drafting uses a smaller model from the same family as the verifier, with no access to its internal state. Dependent drafting (e.g., EAGLE-3) uses lightweight heads trained to read the verifier’s hidden states, sharing input embeddings and using concatenated features for guidance. SD2 strikes a middle ground, leveraging verifier features for steering while retaining the generalization capabilities of independent drafters.
  • Figure 2: The steering mechanism in SD2 works by concatenating the verifier’s high-, medium-, and low-level hidden features and passing them through a linear projection to produce a steering vector. This embedding is transformed by another linear layer into a set of biases, which are added to all MLP hidden states in the drafter just before the activation function, as detailed in \ref{['eq:before_steering']} and \ref{['eq:after_steering']}.
  • Figure 3: The training process of SD2 aligns the drafter's ($\pi_D$) probability distribution to the verifier's ($\pi_V$). To achieve this, we randomly choose an offset $\delta \in [1,k]$ to simulate drafting the $\delta$'th token of a block. After extracting $g$ from on the verifier's activations, we compute $\pi_D(x_t | x_{1:t-1}, g_{t-\delta})$ and use the Kullback-Leibler divergence $D_{\text{KL}}(\pi_V(\cdot |x_{1:t-1}) \Vert \pi_D(\cdot |x_{1:t-1}, g_{t-\delta}))$ as loss. In addition to $W_{s}$ (see \ref{['fig:overview']}), both $W_{hml}$ and $\pi_D$ are trained. The verifier $\pi_V$ stays frozen throughout training.
  • Figure 4: Number of tokens accepted per block at different positions. We compare how different drafter/verifier pairs fare at different positions throughout the generation process: A point at position $x$ means the average number of accepted tokens per block for blocks with the last generated token having position $x\pm8$. As can be seen, large pretrained drafters can leverage their vast training data to maintain strong drafting performance with increased sequence length. SD2 minimally interferes with this behavior.
  • Figure 5: The average number of accepted tokens per Block for the different speculative decoding setups (Left to right: EAGLE-3*, Pretrained, Distilled, SD2) averaged across all tasks. Solid bars correspond to $T=1$ (sampling), and the hashed bars to $T=0$ (greedy). One can see that SD2 consistently achieves higher acceptance rates compared to both the Distilled and Pretrained drafter. In Vicuna 1.3, the number of active parameters for the drafter (Llama 160M) is less than half as many as the respective EAGLE-3* model. At such small sizes, pretrained drafters lose their competitiveness to dependent heads; however, SD2 can bridge this gap.
  • ...and 2 more figures