ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

Evangelos Georganas; Dhiraj Kalamkar; Alexander Kozlov; Alexander Heinecke

ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

Evangelos Georganas, Dhiraj Kalamkar, Alexander Kozlov, Alexander Heinecke

TL;DR

ML-SpecQD introduces a plug-and-play acceleration for LLM inference by combining MXFP4 weight-only quantization with a multi-level speculative decoding framework. The approach uses MXFP4 as a quantized draft to pair with standard SD, and extends this with a hierarchical two-level (and extendable) drafting strategy to further accelerate draft-token generation. A high-performance MXFP4 GEMM microkernel and TPP-based PyTorch extensions enable efficient CPU inference, achieving up to $2.72\times$ speedups over BF16 baselines while preserving accuracy. The results on QA and code-generation benchmarks demonstrate the practical viability of quantized drafts and multi-level speculation for edge and AI-PC environments, with strong potential for future ultra-low-bit quantization and production deployment.

Abstract

Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing any accuracy over the 16-bit model inference. In a typical SD setup, the idea is to use a full-precision, small, fast model as "draft" to generate the next few tokens and use the "target" large model to verify the draft-generated tokens. The efficacy of this method heavily relies on the acceptance ratio of the draft-generated tokens and the relative token throughput of the draft versus the target model. Nevertheless, an efficient SD pipeline requires pre-training and aligning the draft model to the target model, making it impractical for LLM inference in a plug-and-play fashion. In this work, we propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4. In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline. Then we pursue an opportunity for further acceleration: the MXFP4 draft token generation itself can be accelerated via speculative decoding by using yet another smaller draft. We call our method ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts since it recursively applies speculation for accelerating the draft-token generation. Combining Multi-Level Speculative Decoding with MXFP4 Quantized Drafts we outperform state-of-the-art speculative decoding, yielding speedups up to 2.72x over the BF16 baseline.

ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

TL;DR

Abstract

ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)