Table of Contents
Fetching ...

Lossless KV Cache Compression to 2%

Zhen Yang, J. N. Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang

TL;DR

This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size while maintaining comparable performance levels.

Abstract

Large language models have revolutionized data processing in numerous domains, with their ability to handle extended context reasoning receiving notable recognition. To speed up inference, maintaining a key-value (KV) cache memory is essential. Nonetheless, the growing demands for KV cache memory create significant hurdles for efficient implementation. This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size while maintaining comparable performance levels. CLLA integrates multiple aspects of KV cache compression, including attention head/dimension reduction, layer sharing, and quantization techniques, into a cohesive framework. Our extensive experiments demonstrate that CLLA achieves lossless performance on most tasks while utilizing minimal KV cache, marking a significant advancement in practical KV cache compression.

Lossless KV Cache Compression to 2%

TL;DR

This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size while maintaining comparable performance levels.

Abstract

Large language models have revolutionized data processing in numerous domains, with their ability to handle extended context reasoning receiving notable recognition. To speed up inference, maintaining a key-value (KV) cache memory is essential. Nonetheless, the growing demands for KV cache memory create significant hurdles for efficient implementation. This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size while maintaining comparable performance levels. CLLA integrates multiple aspects of KV cache compression, including attention head/dimension reduction, layer sharing, and quantization techniques, into a cohesive framework. Our extensive experiments demonstrate that CLLA achieves lossless performance on most tasks while utilizing minimal KV cache, marking a significant advancement in practical KV cache compression.

Paper Structure

This paper contains 20 sections, 11 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Overview of the Cross-Layer Latent Attention (CLLA) architecture. During training with CLLA, the preceding layer computes the compressed key-value (KV) pairs after quantization and dequantization. The subsequent layer then continues to utilize the compressed KV from the previous layer. During inference, the model stores the compressed KV pairs of the l-th layer with int4 quantization.
  • Figure 2: Three different approaches for passing latent key-value (KV) pairs and K-rope in the attention mechanism to the next layer within the CLLA architecture. The final CLLA selects the left version (CLLA+share-latent).
  • Figure 3: Left: the model with KV projection weights sharing. Right: the model without KV projection weights sharing.