Table of Contents
Fetching ...

Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Nathan Godey, Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, Éric de la Clergerie, Benoît Sagot

TL;DR

Q-Filters introduce a training-free KV Cache compression method that exploits the geometry of Query and Key representations to prune less informative KV pairs without accessing attention weights. By deriving a calibration-based projection direction from the dominant Q-singular component, the method provides accurate estimates of input relevance and enables efficient KV eviction compatible with FlashAttention. Empirical results across language modeling, needle-in-a-haystack retrieval, and long-context tasks show competitive performance at up to 32× compression and strong robustness to calibration data choices, with notable gains in perplexity and retrieval accuracy. The approach offers a practical, scalable solution to memory bottlenecks in long-context generation, enabling faster inference with minimal retraining and broad compatibility with modern memory-efficient attention mechanisms.

Abstract

Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

TL;DR

Q-Filters introduce a training-free KV Cache compression method that exploits the geometry of Query and Key representations to prune less informative KV pairs without accessing attention weights. By deriving a calibration-based projection direction from the dominant Q-singular component, the method provides accurate estimates of input relevance and enables efficient KV eviction compatible with FlashAttention. Empirical results across language modeling, needle-in-a-haystack retrieval, and long-context tasks show competitive performance at up to 32× compression and strong robustness to calibration data choices, with notable gains in perplexity and retrieval accuracy. The approach offers a practical, scalable solution to memory bottlenecks in long-context generation, enabling faster inference with minimal retraining and broad compatibility with modern memory-efficient attention mechanisms.

Abstract

Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

Paper Structure

This paper contains 24 sections, 1 theorem, 11 equations, 14 figures, 2 tables.

Key Result

Theorem 3.3

Under assum:1 and assum:2, we have: where $\kappa^h$ is a positive constant.

Figures (14)

  • Figure 1: Accuracy vs Time to First Token (TTFT) tradeoff for Llama-3.1-70B-Instruct, measured on the Ruler dataset with $\times$8 compression. The TTFT is measured using 2 A100 GPUs on 8192-tokens sequences.
  • Figure 2: Left and center: distributions of the projections of $Q^h$ and $K^h$ on $u^h$ for Llama-3.1-8B. Right: estimates of $\left|\mathbb{E}_{i}(\langle Q^h_i, v_m \rangle)\right|$ where $v_m$ are the right vectors from the SVD of a set of $Q^h$ representations from different Llama models, averaged over all layers and heads.
  • Figure 3: Projection of $Q^h$ and $K^h$ vectors in the first two components of the SVD of $Q^h$ for different heads in Llama-3.2-1B. The colour on the $K$ projections represents the $\log$-average attention at the corresponding index for the current head. The $x$-axis and $y$-axis indicate the results of a projection of the representations on $v_1$ and $v_2$, respectively.
  • Figure 4: Spearman rank correlation between KV compression scoring metrics and the observed attention $S^h$ for Llama-3.2-1B, for K-norm (top) and Q-Filters (bottom).
  • Figure 5: Generation performance for a KV Cache size limited to 512 items for Llama-3.1-8B (top) and Llama-3.1-70B (bottom).
  • ...and 9 more figures

Theorems & Definitions (1)

  • Theorem 3.3: proof in \ref{['app:proof']}