Table of Contents
Fetching ...

Effectively Compress KV Heads for LLM

Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu

TL;DR

The paper tackles the memory bottleneck of KV caches in long-context LLM inference by revealing a low-rank structure in KV caches and leveraging it to compress KV heads. It introduces a low-rank, SVD-based framework to convert multi-head attention (MHA) into grouped-query attention (GQA), with RoPE-specific strategies to preserve performance. Through calibration-based SVD-a initialization and LoRA fine-tuning, the approach can reduce KV heads by 50-75% while maintaining comparable accuracy and significantly boosting throughput on BLOOMZ-7B1 and LLaMA2-7B/13B. This work offers a practical, data-efficient path to more memory-efficient LLM deployment in resource-constrained settings.

Abstract

The advent of pre-trained large language models (LLMs) has revolutionized various natural language processing tasks. These models predominantly employ an auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to eliminate redundant calculations for previous tokens. Nevertheless, as context lengths and batch sizes increase, the linear expansion in memory footprint of KV caches becomes a key bottleneck of LLM deployment, which decreases generation speeds significantly. To mitigate this issue, previous techniques like multi-query attention (MQA) and grouped-query attention (GQA) have been developed, in order to reduce KV heads to accelerate inference with comparable accuracy to multi-head attention (MHA). Despite their effectiveness, existing strategies for compressing MHA often overlook the intrinsic properties of the KV caches. In this work, we explore the low-rank characteristics of the KV caches and propose a novel approach for compressing KV heads. In particular, we carefully optimize the MHA-to-GQA transformation to minimize compression error, and to remain compatible with rotary position embeddings (RoPE), we also introduce specialized strategies for key caches with RoPE. We demonstrate that our method can compress half or even three-quarters of KV heads while maintaining performance comparable to the original LLMs, which presents a promising direction for more efficient LLM deployment in resource-constrained environments.

Effectively Compress KV Heads for LLM

TL;DR

The paper tackles the memory bottleneck of KV caches in long-context LLM inference by revealing a low-rank structure in KV caches and leveraging it to compress KV heads. It introduces a low-rank, SVD-based framework to convert multi-head attention (MHA) into grouped-query attention (GQA), with RoPE-specific strategies to preserve performance. Through calibration-based SVD-a initialization and LoRA fine-tuning, the approach can reduce KV heads by 50-75% while maintaining comparable accuracy and significantly boosting throughput on BLOOMZ-7B1 and LLaMA2-7B/13B. This work offers a practical, data-efficient path to more memory-efficient LLM deployment in resource-constrained settings.

Abstract

The advent of pre-trained large language models (LLMs) has revolutionized various natural language processing tasks. These models predominantly employ an auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to eliminate redundant calculations for previous tokens. Nevertheless, as context lengths and batch sizes increase, the linear expansion in memory footprint of KV caches becomes a key bottleneck of LLM deployment, which decreases generation speeds significantly. To mitigate this issue, previous techniques like multi-query attention (MQA) and grouped-query attention (GQA) have been developed, in order to reduce KV heads to accelerate inference with comparable accuracy to multi-head attention (MHA). Despite their effectiveness, existing strategies for compressing MHA often overlook the intrinsic properties of the KV caches. In this work, we explore the low-rank characteristics of the KV caches and propose a novel approach for compressing KV heads. In particular, we carefully optimize the MHA-to-GQA transformation to minimize compression error, and to remain compatible with rotary position embeddings (RoPE), we also introduce specialized strategies for key caches with RoPE. We demonstrate that our method can compress half or even three-quarters of KV heads while maintaining performance comparable to the original LLMs, which presents a promising direction for more efficient LLM deployment in resource-constrained environments.
Paper Structure (18 sections, 10 equations, 2 figures, 12 tables)

This paper contains 18 sections, 10 equations, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Ratio of energy kept in each KV cache for LLaMA2-7B when 25% (left) and 50% (right) dimensions are retained. The $x$-axis is the block index.
  • Figure 2: Illustration of compressing key heads into GQA pattern. Note that the strategy of compressing value heads is similar to this.