Table of Contents
Fetching ...

RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations

Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan

TL;DR

RotateKV addresses the KV cache memory bottleneck in LLM inference by introducing an outlier-aware rotation framework for 2-bit KV quantization. It combines Outlier-Aware Rotation, Pre-RoPE Grouped-Head Rotation, and Attention-Sink-Aware Quantization to deliver robust, high compression with minimal accuracy loss, achieving less than 0.3 perplexity degradation on WikiText-2 with LLaMA-2-13B and notable memory and speed improvements. The approach enables larger batch sizes and longer contexts while preserving CoT reasoning and long-context performance across diverse tasks. This work advances rotation-based quantization for KV caches and provides practical, scalable techniques for memory-efficient LLM inference.

Abstract

Key-Value (KV) cache facilitates efficient large language models (LLMs) inference by avoiding recomputation of past KVs. As the batch size and context length increase, the oversized KV caches become a significant memory bottleneck, highlighting the need for efficient compression. Existing KV quantization rely on fine-grained quantization or the retention of a significant portion of high bit-widths caches, both of which compromise compression ratio and often fail to maintain robustness at extremely low average bit-widths. In this work, we explore the potential of rotation technique for 2-bit KV quantization and propose RotateKV, which achieves accurate and robust performance through the following innovations: (i) Outlier-Aware Rotation, which utilizes channel-reordering to adapt the rotations to varying channel-wise outlier distributions without sacrificing the computational efficiency of the fast Walsh-Hadamard transform (FWHT); (ii) Pre-RoPE Grouped-Head Rotation, which mitigates the impact of rotary position embedding (RoPE) on proposed outlier-aware rotation and further smooths outliers across heads; (iii) Attention-Sink-Aware Quantization, which leverages the massive activations to precisely identify and protect attention sinks. RotateKV achieves less than 0.3 perplexity (PPL) degradation with 2-bit quantization on WikiText-2 using LLaMA-2-13B, maintains strong CoT reasoning and long-context capabilities, with less than 1.7\% degradation on GSM8K, outperforming existing methods even at lower average bit-widths. RotateKV also showcases a 3.97x reduction in peak memory usage, supports 5.75x larger batch sizes, and achieves a 2.32x speedup in decoding stage.

RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations

TL;DR

RotateKV addresses the KV cache memory bottleneck in LLM inference by introducing an outlier-aware rotation framework for 2-bit KV quantization. It combines Outlier-Aware Rotation, Pre-RoPE Grouped-Head Rotation, and Attention-Sink-Aware Quantization to deliver robust, high compression with minimal accuracy loss, achieving less than 0.3 perplexity degradation on WikiText-2 with LLaMA-2-13B and notable memory and speed improvements. The approach enables larger batch sizes and longer contexts while preserving CoT reasoning and long-context performance across diverse tasks. This work advances rotation-based quantization for KV caches and provides practical, scalable techniques for memory-efficient LLM inference.

Abstract

Key-Value (KV) cache facilitates efficient large language models (LLMs) inference by avoiding recomputation of past KVs. As the batch size and context length increase, the oversized KV caches become a significant memory bottleneck, highlighting the need for efficient compression. Existing KV quantization rely on fine-grained quantization or the retention of a significant portion of high bit-widths caches, both of which compromise compression ratio and often fail to maintain robustness at extremely low average bit-widths. In this work, we explore the potential of rotation technique for 2-bit KV quantization and propose RotateKV, which achieves accurate and robust performance through the following innovations: (i) Outlier-Aware Rotation, which utilizes channel-reordering to adapt the rotations to varying channel-wise outlier distributions without sacrificing the computational efficiency of the fast Walsh-Hadamard transform (FWHT); (ii) Pre-RoPE Grouped-Head Rotation, which mitigates the impact of rotary position embedding (RoPE) on proposed outlier-aware rotation and further smooths outliers across heads; (iii) Attention-Sink-Aware Quantization, which leverages the massive activations to precisely identify and protect attention sinks. RotateKV achieves less than 0.3 perplexity (PPL) degradation with 2-bit quantization on WikiText-2 using LLaMA-2-13B, maintains strong CoT reasoning and long-context capabilities, with less than 1.7\% degradation on GSM8K, outperforming existing methods even at lower average bit-widths. RotateKV also showcases a 3.97x reduction in peak memory usage, supports 5.75x larger batch sizes, and achieves a 2.32x speedup in decoding stage.

Paper Structure

This paper contains 30 sections, 11 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: The distribution of Keys in Layer 10 of LLaMA-2-7B. The proposed outlier-aware adaptive rotation demonstrates outstanding capability in reducing outliers.
  • Figure 2: Magnitude of of LLaMA-2-7B Keys. A small number of channels exhibit disproportionately large magnitudes, and these outlier channels vary across different attention heads.
  • Figure 3: Overview of RotateKV. On the left is the outlier-aware rotation combined with the pre-RoPE pipeline. On the right, we demonstrate attention-sink-aware quantization. Since attention is concentrated on the massive activations, we can identify attention sinks in the current attention layer by utilizing the token indices of these massive activations from the output of the previous decoder block.
  • Figure 4: Existing rotation paradigm.
  • Figure 5: Visualizations of Decoder Block 10 outputs and the attention scores in Attention Layer 11 from the LLaMA-2-7B using input from WikiText-2 dataset. As shown in Figure \ref{['fig:mass_a']}, massive activations occur at tokens 0 and 110, in channels 1415 and 2533. In the subsequent attention layer, attention is focused on tokens 1 and 110 across all heads, as illustrated in Figure \ref{['fig:mass_b']}.
  • ...and 3 more figures