Table of Contents
Fetching ...

ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

Evangelos Georganas, Dhiraj Kalamkar, Alexander Kozlov, Alexander Heinecke

TL;DR

ML-SpecQD introduces a plug-and-play acceleration for LLM inference by combining MXFP4 weight-only quantization with a multi-level speculative decoding framework. The approach uses MXFP4 as a quantized draft to pair with standard SD, and extends this with a hierarchical two-level (and extendable) drafting strategy to further accelerate draft-token generation. A high-performance MXFP4 GEMM microkernel and TPP-based PyTorch extensions enable efficient CPU inference, achieving up to $2.72\times$ speedups over BF16 baselines while preserving accuracy. The results on QA and code-generation benchmarks demonstrate the practical viability of quantized drafts and multi-level speculation for edge and AI-PC environments, with strong potential for future ultra-low-bit quantization and production deployment.

Abstract

Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing any accuracy over the 16-bit model inference. In a typical SD setup, the idea is to use a full-precision, small, fast model as "draft" to generate the next few tokens and use the "target" large model to verify the draft-generated tokens. The efficacy of this method heavily relies on the acceptance ratio of the draft-generated tokens and the relative token throughput of the draft versus the target model. Nevertheless, an efficient SD pipeline requires pre-training and aligning the draft model to the target model, making it impractical for LLM inference in a plug-and-play fashion. In this work, we propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4. In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline. Then we pursue an opportunity for further acceleration: the MXFP4 draft token generation itself can be accelerated via speculative decoding by using yet another smaller draft. We call our method ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts since it recursively applies speculation for accelerating the draft-token generation. Combining Multi-Level Speculative Decoding with MXFP4 Quantized Drafts we outperform state-of-the-art speculative decoding, yielding speedups up to 2.72x over the BF16 baseline.

ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

TL;DR

ML-SpecQD introduces a plug-and-play acceleration for LLM inference by combining MXFP4 weight-only quantization with a multi-level speculative decoding framework. The approach uses MXFP4 as a quantized draft to pair with standard SD, and extends this with a hierarchical two-level (and extendable) drafting strategy to further accelerate draft-token generation. A high-performance MXFP4 GEMM microkernel and TPP-based PyTorch extensions enable efficient CPU inference, achieving up to speedups over BF16 baselines while preserving accuracy. The results on QA and code-generation benchmarks demonstrate the practical viability of quantized drafts and multi-level speculation for edge and AI-PC environments, with strong potential for future ultra-low-bit quantization and production deployment.

Abstract

Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing any accuracy over the 16-bit model inference. In a typical SD setup, the idea is to use a full-precision, small, fast model as "draft" to generate the next few tokens and use the "target" large model to verify the draft-generated tokens. The efficacy of this method heavily relies on the acceptance ratio of the draft-generated tokens and the relative token throughput of the draft versus the target model. Nevertheless, an efficient SD pipeline requires pre-training and aligning the draft model to the target model, making it impractical for LLM inference in a plug-and-play fashion. In this work, we propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4. In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline. Then we pursue an opportunity for further acceleration: the MXFP4 draft token generation itself can be accelerated via speculative decoding by using yet another smaller draft. We call our method ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts since it recursively applies speculation for accelerating the draft-token generation. Combining Multi-Level Speculative Decoding with MXFP4 Quantized Drafts we outperform state-of-the-art speculative decoding, yielding speedups up to 2.72x over the BF16 baseline.

Paper Structure

This paper contains 19 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Various flavors of Speculative Decoding: (a) Speculative decoding with a large target model (e.g. Llama 7B in BF16) and a custom, small draft model (e.g. Tiny Llama 68M), (b) Speculative decoding with a large target model and an MXFP4-direct-cast-quantized target model as draft model, (c) Multi-Level Speculative Decoding: a large target model uses an MXFP4-direct-cast-quantized target model as draft model, and subsequently the MXFP4-direct-cast-quantized model uses a smaller draft, potentially also MXFP4-quantized.
  • Figure 2: Expected speculative decoding speedup for different draft-token acceptance ratios, for 3 different drafts that are $4\times$, $20\times$ and $100\times$ faster (smaller) than the target model.
  • Figure 3: Expected speedup in a 2-level speculative decoding setup, with a large BF16 target model, a 4-bit quantized model (4$\times$ faster than target) as intermediate draft, and a small model (100$\times$ faster than target) as a last-level draft.
  • Figure 4: AVX2 GEMM microkernel with MXFP4 weights and vnni-INT8 FMAs ($M=8$, $N=1$, $K=32$).
  • Figure 5: QA benchmark: Acceptance ratios (%) of draft tokens in two speculative decoding setups: (a) MXFP4-direct-cast-quantized draft (blue bars) and (b) 68M custom Llama draft (orange bars) over 100 input prompts.
  • ...and 3 more figures