Table of Contents
Fetching ...

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen

TL;DR

This paper proposes a low-rank approximation of KV weight matrices, allowing for plug-in integration with existing transformer-based LLMs without model retraining, and introduces an orthogonal approach to KV cache compression.

Abstract

The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly with sequence length and batch size, posing a significant bottleneck in LLM deployment. Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages, which requires extensive parameter tuning thus unsuitable for pre-trained LLMs; (2) KV cache compression at test time, primarily through token eviction policies, which often overlook inter-layer dependencies and can be task-specific. This paper introduces an orthogonal approach to KV cache compression. We propose a low-rank approximation of KV weight matrices, allowing for plug-in integration with existing transformer-based LLMs without model retraining. To effectively compress KV cache at the weight level, we adjust for layerwise sensitivity and introduce a progressive compression strategy, which is supported by our theoretical analysis on how compression errors accumulate in deep networks. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages. Extensive experiments with LLaMA models ranging from 8B to 70B parameters across various tasks show that our approach significantly reduces the GPU memory footprint while maintaining performance.

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

TL;DR

This paper proposes a low-rank approximation of KV weight matrices, allowing for plug-in integration with existing transformer-based LLMs without model retraining, and introduces an orthogonal approach to KV cache compression.

Abstract

The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly with sequence length and batch size, posing a significant bottleneck in LLM deployment. Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages, which requires extensive parameter tuning thus unsuitable for pre-trained LLMs; (2) KV cache compression at test time, primarily through token eviction policies, which often overlook inter-layer dependencies and can be task-specific. This paper introduces an orthogonal approach to KV cache compression. We propose a low-rank approximation of KV weight matrices, allowing for plug-in integration with existing transformer-based LLMs without model retraining. To effectively compress KV cache at the weight level, we adjust for layerwise sensitivity and introduce a progressive compression strategy, which is supported by our theoretical analysis on how compression errors accumulate in deep networks. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages. Extensive experiments with LLaMA models ranging from 8B to 70B parameters across various tasks show that our approach significantly reduces the GPU memory footprint while maintaining performance.
Paper Structure (30 sections, 3 theorems, 25 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 30 sections, 3 theorems, 25 equations, 4 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $W \in \mathbb{R}^{m \times n}$ be a weight matrix (either key or value), and let $\tilde{W} \in \mathbb{R}^{m \times n}$ be its rank-$k$ approximation obtained via truncated singular value decomposition (SVD). For any input vector $x \in \mathbb{R}^n$, the error introduced by the approximation where $\sigma_{k+1}$ is the $(k+1)$-th singular value of $W$.

Figures (4)

  • Figure 1: LoRC compresses KV-cache by decomposing the KV weight matrices in attention heads. The progressive compression strategy retains more dimension for KV weights in shallow layers and compresses the KV weights in deep layers more aggressively.
  • Figure 2: Performance of KV cache compression on LLaMA models. LoRC compresses the KV weights with a progressive strategy, while the baselines compress each layer with the same ratio. The horizontal dashed line indicates the performance with a full-cache model.
  • Figure 3: Single-layer compression results. This experiment uses LLaMA-3-Instruct-8B on the OpenBookQA dataset.
  • Figure 4: Layerwise relative reconstruction errors. $wk_{err}$ and $wv_{err}$ denote the relative difference between the original key/value matrices and their corresponding low-rank approximations measured using the Frobinus norm. The compression ratio is computed as $r={d_c \over N_h\times d_h}$, where $N_h$ is the number of attention heads and $d_h, d_c$ is the original and compressed hidden dimensions respectively.

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Theorem 3