Draft-based Approximate Inference for LLMs

Kevin Galim; Ethan Ewer; Wonjun Kang; Minjae Lee; Hyung Il Koo; Kangwook Lee

Draft-based Approximate Inference for LLMs

Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee

TL;DR

This work introduces Draft-based Approximate Inference, a lookahead-based framework that uses a lightweight draft model to predict future outputs and refine token/KV importance estimates, enabling more accurate yet memory- and compute-efficient long-context inference. It yields two main methods, SpecKV for KV cache dropping and SpecPC for prompt compression, plus a cascaded SpecKV-PC pipeline that combines both approaches for superior performance. Theoretical analyses bound the impact of draft quality on importance estimates and attention activations, while empirical results on RULER and LongBench across multiple model families show consistent accuracy gains with fixed memory/compute budgets, including substantial latency and memory savings at 64K contexts. The work demonstrates the practical viability of draft-model lookahead to enhance long-context LLM inference, with broad implications for scalable deployment and future extensions in sparse decoding and iterative KV management.

Abstract

Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.

Draft-based Approximate Inference for LLMs

TL;DR

Abstract

Draft-based Approximate Inference for LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (10)