Speculative Decoding and Beyond: An In-Depth Survey of Techniques
Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, Sai Qian Zhang
TL;DR
This work tackles the latency of autoregressive decoding in very large models by surveying generation-refinement frameworks that split generation into a parallelizable drafting phase and a quality-focused refinement phase. It introduces a unified two-phase taxonomy and detailed categorizations of draft-generation (predefined tokens, retrieval, N-gram, auto-regressive drafts) and refinement (linear vs tree-based verification, iterative decoding), along with system-level deployment considerations. The analysis covers text, image, speech, and multimodal applications, and highlights practical deployment insights for edge, distributed, and hardware-optimized settings. It identifies open theoretical and scalability challenges, emphasizing cross-modal generalization and efficient coordination among multiple draft and target models as draft counts rise.
Abstract
Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models, particularly for real-time applications. While traditional optimization approaches like pruning and quantization often compromise model quality, recent advances in generation-refinement frameworks demonstrate that this trade-off can be significantly mitigated. This survey presents a comprehensive taxonomy of generation-refinement frameworks, analyzing methods across autoregressive sequence tasks. We categorize methods based on their generation strategies (from simple n-gram prediction to sophisticated draft models) and refinement mechanisms (including single-pass verification and iterative approaches). Through systematic analysis of both algorithmic innovations and system-level implementations, we examine deployment strategies across computing environments and explore applications spanning text, images, and speech generation. This systematic examination of both theoretical frameworks and practical implementations provides a foundation for future research in efficient autoregressive decoding.
