Table of Contents
Fetching ...

Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu

TL;DR

The paper tackles the challenge of accelerating memory-bound LLM inference by studying the compatibility of speculative decoding and quantization. It shows that applying advanced speculative decoding (EAGLE-2) to 4-bit weight quantized models yields limited gains due to the heavy tree-style verification cost, and it introduces a two-level hierarchical framework (HierSpec) that uses a small intermediate model to bridge drafting and verification. HierSpec delivers a substantial 2.78× speedup on W4A16 Llama-3-70B on an A100, outperforming EAGLE-2 by 1.31×, and demonstrates compatibility with EAGLE-3 checkpoints under favorable alignments. The work provides a practical pathway to jointly exploit speculative decoding and quantization for memory-bound LLM inference, and releases code at https://github.com/AI9Stars/SpecMQuant.

Abstract

Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.

Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

TL;DR

The paper tackles the challenge of accelerating memory-bound LLM inference by studying the compatibility of speculative decoding and quantization. It shows that applying advanced speculative decoding (EAGLE-2) to 4-bit weight quantized models yields limited gains due to the heavy tree-style verification cost, and it introduces a two-level hierarchical framework (HierSpec) that uses a small intermediate model to bridge drafting and verification. HierSpec delivers a substantial 2.78× speedup on W4A16 Llama-3-70B on an A100, outperforming EAGLE-2 by 1.31×, and demonstrates compatibility with EAGLE-3 checkpoints under favorable alignments. The work provides a practical pathway to jointly exploit speculative decoding and quantization for memory-bound LLM inference, and releases code at https://github.com/AI9Stars/SpecMQuant.

Abstract

Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78 speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31. Code available at https://github.com/AI9Stars/SpecMQuant.

Paper Structure

This paper contains 18 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of speedup ratios for Llama-3-8B (relative to FP16) and Llama-3-70B (relative to W8A8) under various quantization methods and with EAGLE-2 integration. Solid bars show speedup from quantization alone, dashed bars represent the additional speedup from EAGLE-2, and red arrows indicate the relative speedup achieved by EAGLE-2 across different quantized models.
  • Figure 2: Comparison of average accepted length, verification-to-decoding ratio, and speedup for various quantization precisions (FP16, W8A8, W4A16, W4A8-QQQ, W4A8-QQQ-g128) on Llama-3-8B with EAGLE-2, evaluated on A100 and RTX 3090. Panels (a–c) show A100 results; (d–f) show RTX 3090 results.
  • Figure 3: Comparison of average accepted length, verification-to-decoding ratio, and speedup for various quantization precisions (W8A8, W4A16, W4A8-QQQ, W4A8-QQQ-g128) on Llama-3-70B with EAGLE-2 on A100.
  • Figure 4: Speedup comparison of different EAGLE-2 configurations and vanilla speculative decoding on Llama-3 models. EG-2(6/3, full/half) uses 6 or 3 draft passes with full (60, 48) or half (30, 24) tree sizes; SP(6) denotes vanilla speculative decoding with 6 draft passes.
  • Figure 5: Comparison of drafting time (per draft length) and verification time of three speculative decoding methods applied on W4A16 Llama-3-70B on A100.