Table of Contents
Fetching ...

CORM: Cache Optimization with Recent Message for Large Language Model Inference

Jincheng Dai, Zhuowei Huang, Haiyun Jiang, Chen Chen, Deng Cai, Wei Bi, Shuming Shi

TL;DR

This work tackles the KV cache memory bottleneck in large language model inference by exploiting attention sparsity and the observation that similar queries rely on similar keys. They introduce CORM, a dynamic eviction policy that uses recent query attention messages to retain only the most informative key-value pairs without model fine-tuning. Empirical results on LLaMA2-7B-Chat and Vicuna-7b-v1.5-16k show up to $70\%$ KV cache reduction with negligible performance degradation across LongBench tasks, and CORM can be integrated with grouped-query attention (GQA) for further compression. The approach generalizes across multiple models (with caveats for non-RoPE encodings) and offers a practical, training-free path to memory-efficient LLM inference with strong potential for real-world deployment.

Abstract

Large Language Models (LLMs), despite their remarkable performance across a wide range of tasks, necessitate substantial GPU memory and consume significant computational resources. Beyond the memory taken up by model weights, the memory used by the KV cache rises linearly with sequence length, becoming a primary bottleneck for inference. In this paper, we introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. Upon thorough investigation, we discover that in most Transformer models, (i) there is a striking similarity between adjacent tokens' query vectors, and (ii) the attention calculation of the current query can rely exclusively on the attention information of a small fraction of preceding queries. Based on these observations, we present CORM, a KV cache eviction policy that dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70\% with negligible performance degradation across six tasks in LongBench. Furthermore, we demonstrate that CORM is compatible with GQA for further compression rate.

CORM: Cache Optimization with Recent Message for Large Language Model Inference

TL;DR

This work tackles the KV cache memory bottleneck in large language model inference by exploiting attention sparsity and the observation that similar queries rely on similar keys. They introduce CORM, a dynamic eviction policy that uses recent query attention messages to retain only the most informative key-value pairs without model fine-tuning. Empirical results on LLaMA2-7B-Chat and Vicuna-7b-v1.5-16k show up to KV cache reduction with negligible performance degradation across LongBench tasks, and CORM can be integrated with grouped-query attention (GQA) for further compression. The approach generalizes across multiple models (with caveats for non-RoPE encodings) and offers a practical, training-free path to memory-efficient LLM inference with strong potential for real-world deployment.

Abstract

Large Language Models (LLMs), despite their remarkable performance across a wide range of tasks, necessitate substantial GPU memory and consume significant computational resources. Beyond the memory taken up by model weights, the memory used by the KV cache rises linearly with sequence length, becoming a primary bottleneck for inference. In this paper, we introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. Upon thorough investigation, we discover that in most Transformer models, (i) there is a striking similarity between adjacent tokens' query vectors, and (ii) the attention calculation of the current query can rely exclusively on the attention information of a small fraction of preceding queries. Based on these observations, we present CORM, a KV cache eviction policy that dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70\% with negligible performance degradation across six tasks in LongBench. Furthermore, we demonstrate that CORM is compatible with GQA for further compression rate.
Paper Structure (35 sections, 4 equations, 13 figures, 10 tables, 1 algorithm)

This paper contains 35 sections, 4 equations, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: Attention sparsity of LLaMA2-7B. (a) Layer-wise attention sparsity. (b) Head-wise attention sparsity of layer 0 and layer 1.
  • Figure 2: Similar queries have similar concerns for keys. We plot the attention maps from two different layers in a sentence. We discretize the attention score and those important keys are shown in bright green. Each attention map has two red borders, the bottom border shows important keys that current query actually focuses on, while another border shows important keys that the most similar query focuses on.
  • Figure 3: Visualization of query vectors' cosine similarity over randomly sampled sentence with a length of 1024 on LLaMA2-7B. The $i$-th row of the map represents cosine similarity of the $i$-th query to all previous queries. The redder the color, the higher the similarity between two queries. The plot reveals that in most cases current query is most similar to recent queries.
  • Figure 4: Relationship between compression rate and sequence length averaged by 10 texts randomly sampled from PG19. Plots show that compression rate with CORM "256+256" (w=256, r=256) closely matches a budget of 1024 for LLaMA2-7B-Chat, and a budget of 2048 for Vicuna-7b-v1.5-16k.
  • Figure 5: Visualization of query vectors' cosine similarity over randomly sampled sentence with a length of 1024 across Falcon-7B, Qwen1.5-7B, LLaMA3-8B, OPT-6.7B.
  • ...and 8 more figures