Table of Contents
Fetching ...

ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs

Yanlin Qi, Xinhang Chen, Huiqiang Jiang, Qitong Wang, Botao Peng, Themis Palpanas

TL;DR

ParisKV addresses the dual challenges of drift-robustness and latency in KV-cache retrieval for long-context LLMs. It introduces a GPU-native, drift-insensitive two-stage retrieval pipeline that normalizes and rotates queries/keys onto a unit hypersphere, uses data-independent centroids for collision-based pruning, and then reranks with calibrated 4-bit quantized codes, fetching final KV entries on demand via UVA. The approach yields strong accuracy, often matching or exceeding full-attention quality, and delivers substantial throughput and latency improvements—up to $2.8\times$ faster decoding within the runnable range and up to $17\times$ and $44\times$ latency reductions over state-of-the-art baselines at million-token scales. This work enables scalable, long-context inference with maintaining accuracy while reducing GPU memory pressure and data movement, facilitating practical deployment of very-long-context LLMs.

Abstract

KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework based on collision-based candidate selection, followed by a quantized inner-product reranking estimator. For million-token contexts, ParisKV supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA), enabling on-demand top-$k$ fetching with minimal overhead. ParisKV matches or outperforms full attention quality on long-input and long-generation benchmarks. It achieves state-of-the-art long-context decoding efficiency: it matches or exceeds full attention speed even at batch size 1 for long contexts, delivers up to 2.8$\times$ higher throughput within full attention's runnable range, and scales to million-token contexts where full attention runs out of memory. At million-token scale, ParisKV reduces decode latency by 17$\times$ and 44$\times$ compared to MagicPIG and PQCache, respectively, two state-of-the-art KV-cache Top-$k$ retrieval baselines.

ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs

TL;DR

ParisKV addresses the dual challenges of drift-robustness and latency in KV-cache retrieval for long-context LLMs. It introduces a GPU-native, drift-insensitive two-stage retrieval pipeline that normalizes and rotates queries/keys onto a unit hypersphere, uses data-independent centroids for collision-based pruning, and then reranks with calibrated 4-bit quantized codes, fetching final KV entries on demand via UVA. The approach yields strong accuracy, often matching or exceeding full-attention quality, and delivers substantial throughput and latency improvements—up to faster decoding within the runnable range and up to and latency reductions over state-of-the-art baselines at million-token scales. This work enables scalable, long-context inference with maintaining accuracy while reducing GPU memory pressure and data movement, facilitating practical deployment of very-long-context LLMs.

Abstract

KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework based on collision-based candidate selection, followed by a quantized inner-product reranking estimator. For million-token contexts, ParisKV supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA), enabling on-demand top- fetching with minimal overhead. ParisKV matches or outperforms full attention quality on long-input and long-generation benchmarks. It achieves state-of-the-art long-context decoding efficiency: it matches or exceeds full attention speed even at batch size 1 for long contexts, delivers up to 2.8 higher throughput within full attention's runnable range, and scales to million-token contexts where full attention runs out of memory. At million-token scale, ParisKV reduces decode latency by 17 and 44 compared to MagicPIG and PQCache, respectively, two state-of-the-art KV-cache Top- retrieval baselines.
Paper Structure (24 sections, 1 theorem, 11 equations, 10 figures, 3 tables)

This paper contains 24 sections, 1 theorem, 11 equations, 10 figures, 3 tables.

Key Result

Proposition 4.1

Let $\hat{\mathbf{k}}\in \mathbb{S}^{D-1}$ be any unit vector and $\mathbf{R}\in\mathbb{R}^{D\times D}$ be Haar-random orthogonal. Let $\tilde{\mathbf{k}}=\mathbf{R}\hat{\mathbf{k}}$ and partition it into $B$ contiguous subspaces of dimension $m$ ($D=Bm$): $\tilde{\mathbf{k}}=[\tilde{\mathbf{k}}_{1}

Figures (10)

  • Figure 1: Retrieval drift results. (a) Recall comparison of different methods on AIME. (b) Centroid drift induced by decoding keys, measured as the mismatch between prefill-only centroids (original centroids in blue) and reference centroids (correct centroids in red) obtained by clustering all keys from both prefill and decoding.
  • Figure 2: ParisKV pipeline. Offline, we construct an analytic centroid codebook and a quantization configuration. During prefill, we materialize the KV cache and build GPU-resident key summaries (centroid IDs for Stage-I vote-based filtering, and low-bit codes with lightweight weights for Stage-II reranking), while asynchronously offloading full-precision KV to CPU memory. During decoding, summaries for newly generated keys are incrementally updated; the GPU performs coarse-to-fine retrieval (voting $\rightarrow$ reranking) using only these summaries, then fetches the selected Top-$k$ KV pairs from CPU via UVA for attention.
  • Figure 3: Illustration of rotation-based codebook assignment on the unit sphere.
  • Figure 4: ParisKV Retrieval algorithm (shown as B.2 in Fig. \ref{['fig:Framework_new']})
  • Figure 5: Sliding-window KV-cache update.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Proposition 4.1: Rotation-induced Beta priors for subspaces