Speculative Decoding and Beyond: An In-Depth Survey of Techniques

Yunhai Hu; Zining Liu; Zhenyuan Dong; Tianfan Peng; Bradley McDanel; Sai Qian Zhang

Speculative Decoding and Beyond: An In-Depth Survey of Techniques

Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, Sai Qian Zhang

TL;DR

This work tackles the latency of autoregressive decoding in very large models by surveying generation-refinement frameworks that split generation into a parallelizable drafting phase and a quality-focused refinement phase. It introduces a unified two-phase taxonomy and detailed categorizations of draft-generation (predefined tokens, retrieval, N-gram, auto-regressive drafts) and refinement (linear vs tree-based verification, iterative decoding), along with system-level deployment considerations. The analysis covers text, image, speech, and multimodal applications, and highlights practical deployment insights for edge, distributed, and hardware-optimized settings. It identifies open theoretical and scalability challenges, emphasizing cross-modal generalization and efficient coordination among multiple draft and target models as draft counts rise.

Abstract

Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models, particularly for real-time applications. While traditional optimization approaches like pruning and quantization often compromise model quality, recent advances in generation-refinement frameworks demonstrate that this trade-off can be significantly mitigated. This survey presents a comprehensive taxonomy of generation-refinement frameworks, analyzing methods across autoregressive sequence tasks. We categorize methods based on their generation strategies (from simple n-gram prediction to sophisticated draft models) and refinement mechanisms (including single-pass verification and iterative approaches). Through systematic analysis of both algorithmic innovations and system-level implementations, we examine deployment strategies across computing environments and explore applications spanning text, images, and speech generation. This systematic examination of both theoretical frameworks and practical implementations provides a foundation for future research in efficient autoregressive decoding.

Speculative Decoding and Beyond: An In-Depth Survey of Techniques

TL;DR

Abstract

Speculative Decoding and Beyond: An In-Depth Survey of Techniques

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)