KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity
Damien Lesens, Beheshteh T. Rakhshan, Guillaume Rabusseau
TL;DR
KQ-SVD introduces a provably optimal, interaction-aware low-rank compression of the KV cache for transformer attention. By directly optimizing the QK interaction with a closed-form factorization A = K^+ Û, B = K^T Û, it preserves attention fidelity better than key-only or concatenated QK methods. The approach extends to Grouped-Query Attention and is supported by theoretical gaps and empirical gains across multiple LLMs on the C4 dataset. This enables memory-efficient inference without retraining, improving practicality for long-context generation. Overall, KQ-SVD advances KV-cache compression by aligning the low-rank approximation with the fundamental inner-product structure of attention.
Abstract
The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are suboptimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and Mistral models demonstrate that our approach consistently delivers superior projection quality.
