CORM: Cache Optimization with Recent Message for Large Language Model Inference
Jincheng Dai, Zhuowei Huang, Haiyun Jiang, Chen Chen, Deng Cai, Wei Bi, Shuming Shi
TL;DR
This work tackles the KV cache memory bottleneck in large language model inference by exploiting attention sparsity and the observation that similar queries rely on similar keys. They introduce CORM, a dynamic eviction policy that uses recent query attention messages to retain only the most informative key-value pairs without model fine-tuning. Empirical results on LLaMA2-7B-Chat and Vicuna-7b-v1.5-16k show up to $70\%$ KV cache reduction with negligible performance degradation across LongBench tasks, and CORM can be integrated with grouped-query attention (GQA) for further compression. The approach generalizes across multiple models (with caveats for non-RoPE encodings) and offers a practical, training-free path to memory-efficient LLM inference with strong potential for real-world deployment.
Abstract
Large Language Models (LLMs), despite their remarkable performance across a wide range of tasks, necessitate substantial GPU memory and consume significant computational resources. Beyond the memory taken up by model weights, the memory used by the KV cache rises linearly with sequence length, becoming a primary bottleneck for inference. In this paper, we introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. Upon thorough investigation, we discover that in most Transformer models, (i) there is a striking similarity between adjacent tokens' query vectors, and (ii) the attention calculation of the current query can rely exclusively on the attention information of a small fraction of preceding queries. Based on these observations, we present CORM, a KV cache eviction policy that dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70\% with negligible performance degradation across six tasks in LongBench. Furthermore, we demonstrate that CORM is compatible with GQA for further compression rate.
