Table of Contents
Fetching ...

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

Zihao Wang, Bin Cui, Shaoduo Gan

TL;DR

The paper tackles the KV-cache memory bottleneck in decoder-only LLM inference by proposing SqueezeAttention, a 2D KV-cache budgeting approach that allocates per-layer budgets based on layer importance inferred from cosine similarity during prefilling. It clusters layers into groups and redistributes the shared sequence-wise cache budget across these groups, enabling more critical layers to cache more tokens while less important ones cache fewer. Across seven models (7B–70B) and five tasks, SqueezeAttention achieves 30–70% memory savings and up to 2.2× throughput, while maintaining or improving accuracy, and it remains compatible with common sequence-wise eviction strategies like H2O, Sliding Window, and StreamingLLM. The method introduces modest one-time prefilling overhead and offers a practical, generalizable enhancement to KV-cache-based LLM inference that reduces energy usage and latency.

Abstract

Optimizing the Key-Value (KV) cache of the Large Language Model (LLM) has been considered critical to saving the cost of inference. Most of the existing KV-cache compression algorithms attempted to sparsify the sequence of tokens by taking advantage of the different importance of tokens. However, most of these methods treat all layers equally, allocating the same KV budget to each layer. This approach is suboptimal, as some layers may be less sensitive to input tokens yet still receive the same budget as others. In this work, we found that by identifying the importance of attention layers, we could optimize the KV-cache jointly from two dimensions, i.e., sequence-wise and layer-wise. Based on our observations regarding layer-wise importance in inference, we propose SqueezeAttention to precisely optimize the allocation of KV-cache budget among layers on-the-fly and then incorporate three representative sequence-wise algorithms to compress the KV-cache for each layer with its very own budget. Specifically, we first measure each layer's importance by calculating the cosine similarity of the input prompt differences before and after the self-attention layers. Based on this similarity, we then categorize the layers into two groups and adjust their KV budgets accordingly. By optimizing the KV-cache from both sequence's and layer's dimensions, SqueezeAttention achieves around 30% to 70% of the memory reductions and up to 2.2 times of throughput improvements in a wide range of LLMs and benchmarks. The code is available at https://github.com/hetailang/SqueezeAttention.

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

TL;DR

The paper tackles the KV-cache memory bottleneck in decoder-only LLM inference by proposing SqueezeAttention, a 2D KV-cache budgeting approach that allocates per-layer budgets based on layer importance inferred from cosine similarity during prefilling. It clusters layers into groups and redistributes the shared sequence-wise cache budget across these groups, enabling more critical layers to cache more tokens while less important ones cache fewer. Across seven models (7B–70B) and five tasks, SqueezeAttention achieves 30–70% memory savings and up to 2.2× throughput, while maintaining or improving accuracy, and it remains compatible with common sequence-wise eviction strategies like H2O, Sliding Window, and StreamingLLM. The method introduces modest one-time prefilling overhead and offers a practical, generalizable enhancement to KV-cache-based LLM inference that reduces energy usage and latency.

Abstract

Optimizing the Key-Value (KV) cache of the Large Language Model (LLM) has been considered critical to saving the cost of inference. Most of the existing KV-cache compression algorithms attempted to sparsify the sequence of tokens by taking advantage of the different importance of tokens. However, most of these methods treat all layers equally, allocating the same KV budget to each layer. This approach is suboptimal, as some layers may be less sensitive to input tokens yet still receive the same budget as others. In this work, we found that by identifying the importance of attention layers, we could optimize the KV-cache jointly from two dimensions, i.e., sequence-wise and layer-wise. Based on our observations regarding layer-wise importance in inference, we propose SqueezeAttention to precisely optimize the allocation of KV-cache budget among layers on-the-fly and then incorporate three representative sequence-wise algorithms to compress the KV-cache for each layer with its very own budget. Specifically, we first measure each layer's importance by calculating the cosine similarity of the input prompt differences before and after the self-attention layers. Based on this similarity, we then categorize the layers into two groups and adjust their KV budgets accordingly. By optimizing the KV-cache from both sequence's and layer's dimensions, SqueezeAttention achieves around 30% to 70% of the memory reductions and up to 2.2 times of throughput improvements in a wide range of LLMs and benchmarks. The code is available at https://github.com/hetailang/SqueezeAttention.
Paper Structure (23 sections, 4 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 23 sections, 4 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Demonstrations of KV-cache policies in inference from the view of the sequence and attention layer. Full Cache (leftmost column) simply stores the KV embeddings for all the tokens in all the layers. Sequence-wise compression algorithms (middle column) drop tokens in the sequence's dimension, where each layer has the same cache budget. SqueezeAttention (rightmost column) further compresses the KV-cache by adaptively re-allocating the cache budgets in the layer's dimension.
  • Figure 2: Visualization of cosine similarity before and after the self-attention calculation of attention each layer. The layers with higher cosine similarity, represented by lighter colors, exert a relatively lower impact on the input vectors.
  • Figure 3: Performance of SqueezeAttention, best baselines, and Full Cache under different cache budgets.
  • Figure 4: Comparisons of per-token decoding memory usage among the Full cache, SqueezeAttention, and best baselines in order to achieve the same accuracy, as shown in Table \ref{['accuary result']}.