Table of Contents
Fetching ...

Speculative Decoding and Beyond: An In-Depth Survey of Techniques

Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, Sai Qian Zhang

TL;DR

This work tackles the latency of autoregressive decoding in very large models by surveying generation-refinement frameworks that split generation into a parallelizable drafting phase and a quality-focused refinement phase. It introduces a unified two-phase taxonomy and detailed categorizations of draft-generation (predefined tokens, retrieval, N-gram, auto-regressive drafts) and refinement (linear vs tree-based verification, iterative decoding), along with system-level deployment considerations. The analysis covers text, image, speech, and multimodal applications, and highlights practical deployment insights for edge, distributed, and hardware-optimized settings. It identifies open theoretical and scalability challenges, emphasizing cross-modal generalization and efficient coordination among multiple draft and target models as draft counts rise.

Abstract

Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models, particularly for real-time applications. While traditional optimization approaches like pruning and quantization often compromise model quality, recent advances in generation-refinement frameworks demonstrate that this trade-off can be significantly mitigated. This survey presents a comprehensive taxonomy of generation-refinement frameworks, analyzing methods across autoregressive sequence tasks. We categorize methods based on their generation strategies (from simple n-gram prediction to sophisticated draft models) and refinement mechanisms (including single-pass verification and iterative approaches). Through systematic analysis of both algorithmic innovations and system-level implementations, we examine deployment strategies across computing environments and explore applications spanning text, images, and speech generation. This systematic examination of both theoretical frameworks and practical implementations provides a foundation for future research in efficient autoregressive decoding.

Speculative Decoding and Beyond: An In-Depth Survey of Techniques

TL;DR

This work tackles the latency of autoregressive decoding in very large models by surveying generation-refinement frameworks that split generation into a parallelizable drafting phase and a quality-focused refinement phase. It introduces a unified two-phase taxonomy and detailed categorizations of draft-generation (predefined tokens, retrieval, N-gram, auto-regressive drafts) and refinement (linear vs tree-based verification, iterative decoding), along with system-level deployment considerations. The analysis covers text, image, speech, and multimodal applications, and highlights practical deployment insights for edge, distributed, and hardware-optimized settings. It identifies open theoretical and scalability challenges, emphasizing cross-modal generalization and efficient coordination among multiple draft and target models as draft counts rise.

Abstract

Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models, particularly for real-time applications. While traditional optimization approaches like pruning and quantization often compromise model quality, recent advances in generation-refinement frameworks demonstrate that this trade-off can be significantly mitigated. This survey presents a comprehensive taxonomy of generation-refinement frameworks, analyzing methods across autoregressive sequence tasks. We categorize methods based on their generation strategies (from simple n-gram prediction to sophisticated draft models) and refinement mechanisms (including single-pass verification and iterative approaches). Through systematic analysis of both algorithmic innovations and system-level implementations, we examine deployment strategies across computing environments and explore applications spanning text, images, and speech generation. This systematic examination of both theoretical frameworks and practical implementations provides a foundation for future research in efficient autoregressive decoding.

Paper Structure

This paper contains 29 sections, 7 figures.

Figures (7)

  • Figure 1: (a) The Llama architecture consists of stacked transformer decoder blocks. (b) Each decoder block contains a self-attention (SA) block and feedforward (FFN) block. (c) During the decoding stage, tokens are generated auto-regressively.
  • Figure 2: Illustration of speculative decoding workflow.
  • Figure 3: A taxonomy of generation-refinement frameworks, showing two phases: (1) Generation of draft tokens through various methods and (2) Refinement through verification strategies.
  • Figure 4: Taxonomy of Speculative Decoding Algorithms. Symbols indicate implementation approach: $\blacktriangle$ Direct application (no training required), $\bullet$ Full model training from scratch, $\blacksquare$ Model fine-tuning, $\bigstar$ Parameter-efficient fine-tuning (PEFT), $\blacklozenge$ Knowledge distillation from target model.
  • Figure 5: Comparison of speculative decoding approaches: (a) Sequential processing where draft generates tokens (0-3) before target verification. (b) Parallel processing where draft generates new tokens while target simultaneously verifies previous ones.
  • ...and 2 more figures