Table of Contents
Fetching ...

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression

Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy

TL;DR

KV cache memory becomes a bottleneck for long-context LLM inference. Eigen Attention performs attention in a low-rank space by deriving offline basis vectors via SVD on calibration data and integrating them into the weight matrices, enabling KV cache compression and faster attention. The approach is orthogonal to existing compression methods and demonstrates up to 40% KV cache reduction and up to 60% latency improvements across OPT, MPT, and Llama families, with modest accuracy reductions. RoPE compatibility and layer-wise rank allotment further enhance practicality for long-sequence generation in real-world deployments.

Abstract

Large language models (LLMs) represent a groundbreaking advancement in the domain of natural language processing due to their impressive reasoning abilities. Recently, there has been considerable interest in increasing the context lengths for these models to enhance their applicability to complex tasks. However, at long context lengths and large batch sizes, the key-value (KV) cache, which stores the attention keys and values, emerges as the new bottleneck in memory usage during inference. To address this, we propose Eigen Attention, which performs the attention operation in a low-rank space, thereby reducing the KV cache memory overhead. Our proposed approach is orthogonal to existing KV cache compression techniques and can be used synergistically with them. Through extensive experiments over OPT, MPT, and Llama model families, we demonstrate that Eigen Attention results in up to 40% reduction in KV cache sizes and up to 60% reduction in attention operation latency with minimal drop in performance. Code is available at https://github.com/UtkarshSaxena1/EigenAttn.

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression

TL;DR

KV cache memory becomes a bottleneck for long-context LLM inference. Eigen Attention performs attention in a low-rank space by deriving offline basis vectors via SVD on calibration data and integrating them into the weight matrices, enabling KV cache compression and faster attention. The approach is orthogonal to existing compression methods and demonstrates up to 40% KV cache reduction and up to 60% latency improvements across OPT, MPT, and Llama families, with modest accuracy reductions. RoPE compatibility and layer-wise rank allotment further enhance practicality for long-sequence generation in real-world deployments.

Abstract

Large language models (LLMs) represent a groundbreaking advancement in the domain of natural language processing due to their impressive reasoning abilities. Recently, there has been considerable interest in increasing the context lengths for these models to enhance their applicability to complex tasks. However, at long context lengths and large batch sizes, the key-value (KV) cache, which stores the attention keys and values, emerges as the new bottleneck in memory usage during inference. To address this, we propose Eigen Attention, which performs the attention operation in a low-rank space, thereby reducing the KV cache memory overhead. Our proposed approach is orthogonal to existing KV cache compression techniques and can be used synergistically with them. Through extensive experiments over OPT, MPT, and Llama model families, we demonstrate that Eigen Attention results in up to 40% reduction in KV cache sizes and up to 60% reduction in attention operation latency with minimal drop in performance. Code is available at https://github.com/UtkarshSaxena1/EigenAttn.
Paper Structure (21 sections, 11 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 21 sections, 11 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison between (a) Standard Attention and (b) Eigen Attention. Eigen Attention utilizes lower dimensional ($r\ll d$) query, key, and value projection matrices than the standard attention operation, leading to KV cache compression and compute FLOPs benefits.
  • Figure 2: Eigenvalue spectrum analysis for OPT-30b model. (a), (b) The Y-axis is the normalized cumulative eigenvalue value after performing SVD on the key value representation matrix, and the X-axis is an index of the largest eigenvalue. (c), (d) Dimensions of the low-rank matrices with normalized cumulative eigenvalue of 0.9.
  • Figure 3: PPL on Wikitext with different KV cache sizes in GB ($n$ = 2048) obtained via different quantization precision and group size. For Eigen Attention, we compress the KV cache to 0.6x and then apply quantization.
  • Figure 4: Ablation Study. (a) Average accuracy (Avg-Acc) on zero-shot tasks trained on Llama-2-7b with an increasing number of calibration samples. (b) Perplexity (PPL) vs fine-tuning steps on the C4 dataset for MPT and Llama family of models.
  • Figure 5: Layerwise rank assignment for key and value determined by Eigen Attention for OPT-30b with 40% compressed KV cache.
  • ...and 1 more figures