Table of Contents
Fetching ...

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

TL;DR

QuantSpec targets efficient long-context LLM inference by marrying self-speculative decoding with a hierarchical 4-bit KV cache and 4-bit weights. It introduces a double full-precision KV cache buffer and a 4-bit hierarchical KV representation to enable fast draft-verification cycles without duplicating full models, achieving high acceptance rates and substantial end-to-end speedups. Kernel-level gains from custom INT4/INT8 KV-attention primitives further bolster performance, with reported speedups approaching $\sim2.5\times$ and memory reductions around $1.3\times$. The approach demonstrates robust improvements across diverse long-context benchmarks, enabling more scalable and latency-efficient deployment of LLMs in edge and long-context settings.

Abstract

Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates ($>$90%) and reliably provides consistent end-to-end speedups upto $\sim2.5\times$, outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by $\sim 1.3\times$ compared to these alternatives.

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

TL;DR

QuantSpec targets efficient long-context LLM inference by marrying self-speculative decoding with a hierarchical 4-bit KV cache and 4-bit weights. It introduces a double full-precision KV cache buffer and a 4-bit hierarchical KV representation to enable fast draft-verification cycles without duplicating full models, achieving high acceptance rates and substantial end-to-end speedups. Kernel-level gains from custom INT4/INT8 KV-attention primitives further bolster performance, with reported speedups approaching and memory reductions around . The approach demonstrates robust improvements across diverse long-context benchmarks, enabling more scalable and latency-efficient deployment of LLMs in edge and long-context settings.

Abstract

Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates (90%) and reliably provides consistent end-to-end speedups upto , outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by compared to these alternatives.

Paper Structure

This paper contains 35 sections, 9 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Throughput in tokens/sec of various decoding methods. QuantSpec achieves $>1.78\times$ speedup over the autoregressive baseline across several context lengths. Benchmarked on LWM-Text-Chat-128k.
  • Figure 2: Breakdown of how arithmetic intensity changes during decoding as the context length and batch size are scaled logarithmically for linear, attention, and aggregate operations. All regimes lie below the ridge plane and thus are memory-bound. The ridge plane is calculated for an NVIDIA A6000 GPU. The colors for the linear and attention surface plots simply represent the magnitude of the arithmetic intensity. The aggregate plot is colored by attention's runtime as a percentage of the total latency. Prefill results in Appendix \ref{['appendix:prefill_ai_analysis']}.
  • Figure 3: How our Hierarchical KV Cache works in the speculative decoding setting.
  • Figure 4: Speedup ratio of QuantSpec compared to autoregressive baseline as we scale the context length. We report QuantSpec with KV cache-only quantization, weight-only quantization, and both. Benchmarked on Llama-2-7B-32k-Instruct using PG-19.
  • Figure 5: During prefill, all regimes lie above the ridge plane and thus are compute-bound.
  • ...and 4 more figures