Table of Contents
Fetching ...

QSpec: Speculative Decoding with Complementary Quantization Schemes

Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu

TL;DR

QSpec presents a training-free quantization paradigm that decouples efficiency from fidelity by combining a fast drafting path with low-precision activations and a verification path with high-precision weight-only quantization, sharing weights and KV caches across phases. The approach leverages high token-level similarity between drafts and final outputs to achieve near-zero overhead switching and high acceptance rates, delivering up to 1.64x throughput gains without fidelity loss and outperforming prior speculative decoding methods in quantized regimes. It demonstrates plug-and-play deployment with strong generalization across model scales, quantization schemes, and workloads, making high-fidelity quantized LLM serving more practical under memory constraints. The work also emphasizes the need to evaluate multi-step reasoning tasks in quantization studies and outlines avenues for adaptive drafting and hardware-aware optimizations to broaden applicability.

Abstract

Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs). While activation-weight joint quantization enables efficient low-precision decoding, it suffers from substantial performance degradation on multi-step reasoning tasks. We propose QSpec, a novel quantization paradigm that decouples efficiency from quality by integrating two complementary schemes via speculative decoding: low-precision joint quantization for fast drafting and high-precision weight-only quantization for accurate verification. QSpec reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models. Compared to high-precision baselines, QSpec achieves up to 1.64x speedup without quality degradation, and outperforms state-of-the-art speculative decoding methods by up to 1.55x in batched settings. Furthermore, QSpec supports plug-and-play deployment and generalizes well across model scales, quantization methods, and workloads. These properties make QSpec a practical and scalable solution for high-fidelity quantized LLM serving under memory-constrained scenarios. Our code is available at https://github.com/hku-netexplo-lab/QSpec.

QSpec: Speculative Decoding with Complementary Quantization Schemes

TL;DR

QSpec presents a training-free quantization paradigm that decouples efficiency from fidelity by combining a fast drafting path with low-precision activations and a verification path with high-precision weight-only quantization, sharing weights and KV caches across phases. The approach leverages high token-level similarity between drafts and final outputs to achieve near-zero overhead switching and high acceptance rates, delivering up to 1.64x throughput gains without fidelity loss and outperforming prior speculative decoding methods in quantized regimes. It demonstrates plug-and-play deployment with strong generalization across model scales, quantization schemes, and workloads, making high-fidelity quantized LLM serving more practical under memory constraints. The work also emphasizes the need to evaluate multi-step reasoning tasks in quantization studies and outlines avenues for adaptive drafting and hardware-aware optimizations to broaden applicability.

Abstract

Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs). While activation-weight joint quantization enables efficient low-precision decoding, it suffers from substantial performance degradation on multi-step reasoning tasks. We propose QSpec, a novel quantization paradigm that decouples efficiency from quality by integrating two complementary schemes via speculative decoding: low-precision joint quantization for fast drafting and high-precision weight-only quantization for accurate verification. QSpec reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models. Compared to high-precision baselines, QSpec achieves up to 1.64x speedup without quality degradation, and outperforms state-of-the-art speculative decoding methods by up to 1.55x in batched settings. Furthermore, QSpec supports plug-and-play deployment and generalizes well across model scales, quantization methods, and workloads. These properties make QSpec a practical and scalable solution for high-fidelity quantized LLM serving under memory-constrained scenarios. Our code is available at https://github.com/hku-netexplo-lab/QSpec.

Paper Structure

This paper contains 37 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Diagrams of different 4-bit quantization schemes. W4A16: uses 4-bit weight and 16-bit activation for inference. W4A4: further adopts 4-bit activation to utilize low-precision W4A4 kernels. QSpec: accelerates W4A16 by drafting tokens with W4A4 and verifying them with W4A16, and applies KV cache overwriting for consistent memory consumption.
  • Figure 2: Scatter plot of token prediction probabilities for Atom-based W4A4 and W4A16 on GSM8K test set, along with their two-dimensional and marginal probability distributions. A striking similarity between the two quantization schemes is observed, laying the foundation of QSpec.
  • Figure 3: A mini-sample of QSpec, where yellow, red, and blue tokens represent W4A4 draft tokens, rejected tokens, and tokens generated directly by W4A16, respectively. While these green ones are draft tokens that have been verified and accepted by W4A16.
  • Figure 4: Per-valid-token latency decomposition for different methods. The latency of QSpec is further decomposed into draft and verify categories for details.
  • Figure 5: Acceptance rate and throughput of Llama3.2-3b (batch size 8) and Llama3-8b-instruct (batch size 16) with respect to the draft token length $\gamma$.
  • ...and 2 more figures