Table of Contents
Fetching ...

Draft-based Approximate Inference for LLMs

Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee

TL;DR

This work introduces Draft-based Approximate Inference, a lookahead-based framework that uses a lightweight draft model to predict future outputs and refine token/KV importance estimates, enabling more accurate yet memory- and compute-efficient long-context inference. It yields two main methods, SpecKV for KV cache dropping and SpecPC for prompt compression, plus a cascaded SpecKV-PC pipeline that combines both approaches for superior performance. Theoretical analyses bound the impact of draft quality on importance estimates and attention activations, while empirical results on RULER and LongBench across multiple model families show consistent accuracy gains with fixed memory/compute budgets, including substantial latency and memory savings at 64K contexts. The work demonstrates the practical viability of draft-model lookahead to enhance long-context LLM inference, with broad implications for scalable deployment and future extensions in sparse decoding and iterative KV management.

Abstract

Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.

Draft-based Approximate Inference for LLMs

TL;DR

This work introduces Draft-based Approximate Inference, a lookahead-based framework that uses a lightweight draft model to predict future outputs and refine token/KV importance estimates, enabling more accurate yet memory- and compute-efficient long-context inference. It yields two main methods, SpecKV for KV cache dropping and SpecPC for prompt compression, plus a cascaded SpecKV-PC pipeline that combines both approaches for superior performance. Theoretical analyses bound the impact of draft quality on importance estimates and attention activations, while empirical results on RULER and LongBench across multiple model families show consistent accuracy gains with fixed memory/compute budgets, including substantial latency and memory savings at 64K contexts. The work demonstrates the practical viability of draft-model lookahead to enhance long-context LLM inference, with broad implications for scalable deployment and future extensions in sparse decoding and iterative KV management.

Abstract

Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.

Paper Structure

This paper contains 43 sections, 5 theorems, 41 equations, 20 figures, 13 tables, 2 algorithms.

Key Result

Theorem 1

If $\|x_i^{(o)} - \hat{x}_i^{(o)}\|_2 \le \epsilon$ for all $i$ and $\|x_j\|_2 \le \sqrt d$ for all $j$, then $\|s - \hat{s}\|_2 \le \epsilon \|W_q W_k^T\|_2$.

Figures (20)

  • Figure 1: (a) Overview of our Draft-based Approximate Inference framework for input token importance estimation in comparison with current approaches and the oracle approach. Prior methods use input tokens to estimate input token importance. Our approach incorporates draft model predictions of future output tokens, yielding more accurate importance estimates. This better aligns with the hypothetical oracle setting, where the true output is known and influential tokens can be precisely identified. (b) On RULER with Llama-3-70B grattafiori2024llama, lookahead-based methods (LAQ++ laq, SpecKV) significantly outperform non-lookahead approaches (H2O h2o, SnapKV snapkv, PyramidKV pyramidkv), with our proposed SpecKV achieving the best overall downstream score.
  • Figure 2: Experimental validation on RULER-32K tasks (5 samples each) using Qwen2.5 models. (a) Lower error $\epsilon$ (\ref{['eqn:eps']}) yields higher downstream scores. Increasing the draft model size (SpecKV) or initial cache size (LAQ++) reduces $\epsilon$, with SpecKV outperforming LAQ++. (b) Importance scores (as used in SpecPC) of the draft and target models are highly correlated. (c) For SpecPC, a larger draft model improves both the token importance correlation ($R^2$) and the final task performance.
  • Figure 3: Overview of SpecKV: Instead of using only the last prompt tokens like SnapKV, SpecKV employs a lightweight draft model to generate lookahead tokens, providing richer context for more accurate KV importance estimation. Tokens in window are always retained.
  • Figure 4: Overview of cascaded compression with SpecKV-PC: First, the draft model produces token importance scores and lookahead tokens. Next, SpecPC uses these scores to compress the initial input prompt. Finally, the target model is prefilled using both the compressed prompt and the lookahead tokens, while SpecKV compresses its KV cache
  • Figure 5: Performance of SpecKV and SpecPC. Both methods consistently outperform all baselines across sequence lengths, maintaining strong results at longer contexts. SpecKV-PC further improves upon SpecKV to achieve state-of-the-art results for KV dropping. Note that H2O and PyramidKV are not plotted for Qwen2.5 32B as their performance falls outside the visible range.
  • ...and 15 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • proof
  • proof
  • Theorem 3
  • proof