Table of Contents
Fetching ...

EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection

Yuhao Zhou, Sirui Song, Boyang Liu, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Zhihao Zhang, Wei Li, Xuanjing Huang

TL;DR

EliteKV tackles the challenge of compressing the KV cache in RoPE-based transformers by decoupling nonlinear RoPE effects from cache compression. It introduces RoPElite to identify per-head frequency preferences and applies Joint Low-Rank Decomposition (J-LRD) to jointly factorize K and V projections into a shared low-rank space, enabling configurable KV-cache reductions. With less than $0.6\%$ of uptraining data, EliteKV reduces the KV cache to as low as $25\%$ of the original size while preserving performance, and at $12.5\%$ it achieves parity with a strong baseline (GQA) at $50\%$ cache, demonstrating scalability across model sizes. The approach offers a practical path to faster inference and lower memory use for RoPE-based foundation models, with robust results across the LLaMA2 family.

Abstract

Rotary Position Embedding (RoPE) enables each attention head to capture multi-frequency information along the sequence dimension and is widely applied in foundation models. However, the nonlinearity introduced by RoPE complicates optimization of the key state in the Key-Value (KV) cache for RoPE-based attention. Existing KV cache compression methods typically store key state before rotation and apply the transformation during decoding, introducing additional computational overhead. This paper introduces EliteKV, a flexible modification framework for RoPE-based models supporting variable KV cache compression ratios. EliteKV first identifies the intrinsic frequency preference of each head using RoPElite, selectively restoring linearity to certain dimensions of key within attention computation. Building on this, joint low-rank compression of key and value enables partial cache sharing. Experimental results show that with minimal uptraining on only $0.6\%$ of the original training data, RoPE-based models achieve a $75\%$ reduction in KV cache size while preserving performance within a negligible margin. Furthermore, EliteKV consistently performs well across models of different scales within the same family.

EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection

TL;DR

EliteKV tackles the challenge of compressing the KV cache in RoPE-based transformers by decoupling nonlinear RoPE effects from cache compression. It introduces RoPElite to identify per-head frequency preferences and applies Joint Low-Rank Decomposition (J-LRD) to jointly factorize K and V projections into a shared low-rank space, enabling configurable KV-cache reductions. With less than of uptraining data, EliteKV reduces the KV cache to as low as of the original size while preserving performance, and at it achieves parity with a strong baseline (GQA) at cache, demonstrating scalability across model sizes. The approach offers a practical path to faster inference and lower memory use for RoPE-based foundation models, with robust results across the LLaMA2 family.

Abstract

Rotary Position Embedding (RoPE) enables each attention head to capture multi-frequency information along the sequence dimension and is widely applied in foundation models. However, the nonlinearity introduced by RoPE complicates optimization of the key state in the Key-Value (KV) cache for RoPE-based attention. Existing KV cache compression methods typically store key state before rotation and apply the transformation during decoding, introducing additional computational overhead. This paper introduces EliteKV, a flexible modification framework for RoPE-based models supporting variable KV cache compression ratios. EliteKV first identifies the intrinsic frequency preference of each head using RoPElite, selectively restoring linearity to certain dimensions of key within attention computation. Building on this, joint low-rank compression of key and value enables partial cache sharing. Experimental results show that with minimal uptraining on only of the original training data, RoPE-based models achieve a reduction in KV cache size while preserving performance within a negligible margin. Furthermore, EliteKV consistently performs well across models of different scales within the same family.

Paper Structure

This paper contains 21 sections, 17 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: The attention computation flow after applying EliteKV. The upper part illustrates RoPElite, where each attention head focuses only on its most important frequency along the sequence dimension. The lower part shows the joint low-rank projection, where the K and V states are represented by a shared cache. The different colored fillings in the elements represent 2D chunks attending to different frequencies.
  • Figure 2: Top-$8$ chunks of different attention heads in different layers. Frequency preference patterns of different attention heads across layers in the LLaMA2-7B model. Numbers increase from high to low frequencies.
  • Figure 3: Performance of top-$r$ chunks. Uptraining proportion represents the proportion of tokens relative to the total number of tokens used during training the original model.
  • Figure 4: The projection matrices processed using S-LRD and J-LRD (Left), and the roles of the resulting matrices in the attention computation flow (Right).
  • Figure 5: The perplexity of the RoPElite model on the dataset with varying compression ratios of the KV cache, as the number of frequency-related chunks retained for each attention head changes under S-LRD and J-LRD. Only points without additional parameter overhead are shown.
  • ...and 3 more figures