Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Hengshuai Yao; Guan Wang

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Hengshuai Yao, Guan Wang

TL;DR

It is argued that selection is an inherently lower-dimensional operation than value transfer, requiring only $\BigO(\log N)$ dimensions to distinguish among $N$ relevant patterns.

Abstract

Standard transformer attention uses identical dimensionality for queries, keys, and values ($d_q = d_k = d_v = \dmodel$). Our insight is that these components serve fundamentally different roles, and this symmetry is unnecessary. Queries and keys produce scalar attention weights (\emph{selection}), while values carry rich semantic representations (\emph{value transfer}). We argue that selection is an inherently lower-dimensional operation than value transfer, requiring only $\BigO(\log N)$ dimensions to distinguish among $N$ relevant patterns. We validate this hypothesis across seven experiments: (1)~positional selection tasks requiring just 1~dimension per head, (2)~content-based retrieval requiring $\sim\!\log_2 N$ dimensions, (3--4)~WikiText-2 and WikiText-103 language modeling where $\dselect = \dmodel/4$ incurs only 4.3\% perplexity increase while reducing QK parameters by 75\%, (5)~post-training SVD compression of GPT-2, revealing keys to be far more compressible than queries, with lightweight QK fine-tuning recovering nearly all quality loss, (6)~a 125M-parameter LLaMA model confirming identical degradation ratios across architectures, and (7)~Mistral-7B (7.2B parameters), where SVD compression followed by QK fine-tuning achieves 75\% key cache savings at just 2.0\% residual quality cost. For existing models, SVD compression followed by QK fine-tuning (3 epochs on a small fraction of pretraining data) achieves 75\% key cache savings at $<$2\% residual quality cost. For a 7B-parameter model serving 128K context, asymmetric attention saves 25\,GB of KV cache per user, enabling approximately 60\% more concurrent users on the same GPU.

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

TL;DR

It is argued that selection is an inherently lower-dimensional operation than value transfer, requiring only

dimensions to distinguish among

relevant patterns.

Abstract

Standard transformer attention uses identical dimensionality for queries, keys, and values (

). Our insight is that these components serve fundamentally different roles, and this symmetry is unnecessary. Queries and keys produce scalar attention weights (\emph{selection}), while values carry rich semantic representations (\emph{value transfer}). We argue that selection is an inherently lower-dimensional operation than value transfer, requiring only

dimensions to distinguish among

relevant patterns. We validate this hypothesis across seven experiments: (1)~positional selection tasks requiring just 1~dimension per head, (2)~content-based retrieval requiring

dimensions, (3--4)~WikiText-2 and WikiText-103 language modeling where

incurs only 4.3\% perplexity increase while reducing QK parameters by 75\%, (5)~post-training SVD compression of GPT-2, revealing keys to be far more compressible than queries, with lightweight QK fine-tuning recovering nearly all quality loss, (6)~a 125M-parameter LLaMA model confirming identical degradation ratios across architectures, and (7)~Mistral-7B (7.2B parameters), where SVD compression followed by QK fine-tuning achieves 75\% key cache savings at just 2.0\% residual quality cost. For existing models, SVD compression followed by QK fine-tuning (3 epochs on a small fraction of pretraining data) achieves 75\% key cache savings at

2\% residual quality cost. For a 7B-parameter model serving 128K context, asymmetric attention saves 25\,GB of KV cache per user, enabling approximately 60\% more concurrent users on the same GPU.

Paper Structure (54 sections, 10 equations, 11 tables)

This paper contains 54 sections, 10 equations, 11 tables.

Introduction
Method
Asymmetric Attention
Theoretical Motivation
Selection as a ranking problem.
Dimensionality for ranking.
What determines $N$?
Value transfer requires full dimensionality.
Post-Training Compression via Factored Keys
KV Cache Implications
Experiments
Experiment 1: Positional Selection (Copy-Back Task)
Setup.
Results.
Experiment 2: Content-Based Selection (Key-Value Retrieval)
...and 39 more sections

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

TL;DR

Abstract

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Authors

TL;DR

Abstract

Table of Contents