Table of Contents
Fetching ...

Make Every Draft Count: Hidden State based Speculative Decoding

Yuetao Chen, Xuliang Wang, Xinzhou Zheng, Ming Li, Peng Wang, Hong Xu

TL;DR

A novel system that transforms discarded drafts into reusable tokens is proposed, to perform auto-regressive prediction at the hidden states level and postpone the integrating token information after the hidden states generation, so the draft hidden states are not contaminated by incorrect tokens, enabling hidden state reuse.

Abstract

Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel. However, while this paradigm successfully increases the arithmetic intensity of memory-bound inference, it causes significant compute inefficiency: the majority of draft tokens fail verification and are discarded, resulting in waste of computation. Motivated by the goal of recollecting this wasted computation, we propose a novel system that transforms discarded drafts into reusable tokens. Our key insight is to perform auto-regressive prediction at the hidden states level and postpone the integrating token information after the hidden states generation, so the draft hidden states are not contaminated by incorrect tokens, enabling hidden state reuse. To implement such a system, first we introduce a draft model architecture based on auto-regressive hidden states, which preserves richer semantics than token-based drafters to facilitate draft repurposing. Second, we design an efficient token information injection mechanism that leverages our specialized draft model to construct high-quality draft token trees and enables resampling tokens from verification failures. Third, we eliminate the overhead hidden in our design to further maximize hardware utilization. We conducted extensive evaluations against various baselines, demonstrating up to a 3.3x speedup against standard speculative decoding.

Make Every Draft Count: Hidden State based Speculative Decoding

TL;DR

A novel system that transforms discarded drafts into reusable tokens is proposed, to perform auto-regressive prediction at the hidden states level and postpone the integrating token information after the hidden states generation, so the draft hidden states are not contaminated by incorrect tokens, enabling hidden state reuse.

Abstract

Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel. However, while this paradigm successfully increases the arithmetic intensity of memory-bound inference, it causes significant compute inefficiency: the majority of draft tokens fail verification and are discarded, resulting in waste of computation. Motivated by the goal of recollecting this wasted computation, we propose a novel system that transforms discarded drafts into reusable tokens. Our key insight is to perform auto-regressive prediction at the hidden states level and postpone the integrating token information after the hidden states generation, so the draft hidden states are not contaminated by incorrect tokens, enabling hidden state reuse. To implement such a system, first we introduce a draft model architecture based on auto-regressive hidden states, which preserves richer semantics than token-based drafters to facilitate draft repurposing. Second, we design an efficient token information injection mechanism that leverages our specialized draft model to construct high-quality draft token trees and enables resampling tokens from verification failures. Third, we eliminate the overhead hidden in our design to further maximize hardware utilization. We conducted extensive evaluations against various baselines, demonstrating up to a 3.3x speedup against standard speculative decoding.
Paper Structure (24 sections, 6 equations, 13 figures, 2 tables, 2 algorithms)

This paper contains 24 sections, 6 equations, 13 figures, 2 tables, 2 algorithms.

Figures (13)

  • Figure 1: Comparing the auto-regressive decoding approach, the token-based speculative decoding approach, and the hidden-states-based speculative decoding approach.
  • Figure 2: For a token-based draft model, an incorrect token implies that all subsequent tokens and hidden states are invalid. In contrast, a hidden-state-based draft model may still retain opportunities for hidden-state reuse.
  • Figure 3: Diagram of our draft model inference timeline.
  • Figure 4: Comparison between iterative and one-pass logits computation.
  • Figure 5: Train loss and accuracy across different model optimizations.
  • ...and 8 more figures