Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu
TL;DR
The paper tackles the challenge of accelerating memory-bound LLM inference by studying the compatibility of speculative decoding and quantization. It shows that applying advanced speculative decoding (EAGLE-2) to 4-bit weight quantized models yields limited gains due to the heavy tree-style verification cost, and it introduces a two-level hierarchical framework (HierSpec) that uses a small intermediate model to bridge drafting and verification. HierSpec delivers a substantial 2.78× speedup on W4A16 Llama-3-70B on an A100, outperforming EAGLE-2 by 1.31×, and demonstrates compatibility with EAGLE-3 checkpoints under favorable alignments. The work provides a practical pathway to jointly exploit speculative decoding and quantization for memory-bound LLM inference, and releases code at https://github.com/AI9Stars/SpecMQuant.
Abstract
Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.
