KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches
Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu
TL;DR
This work addresses the challenge of maintaining long-context capabilities in LLMs by benchmarking a broad taxonomy of efficiency methods for KV cache management, including quantization, token dropping, and prompt compression. It conducts a large-scale, reproducible evaluation across 65 settings and 7 task categories on a mix of transformer-based LLMs and linear-time architectures, revealing that KV cache quantization generally offers robust, cross-task performance while token dropping excels on coding tasks, and that hybrid architectures with some attention components improve long-context retrieval. The authors provide a minimalistic, extensible benchmark platform and actionable insights for future development, highlighting the importance of preserving prefill fidelity, the trade-offs between compression and memory, and the gap in needle-in-a-haystack robustness. The results have practical implications for deploying long-context LLMs in real systems, guiding method selection based on task type and hardware constraints while underscoring areas where further improvements are needed.
Abstract
Long context capability is a crucial competency for large language models (LLMs) as it mitigates the human struggle to digest long-form texts. This capability enables complex task-solving scenarios such as book summarization, code assistance, and many more tasks that are traditionally manpower-intensive. However, transformer-based LLMs face significant challenges with long context input due to the growing size of the KV cache and the intrinsic complexity of attending to extended inputs; where multiple schools of efficiency-driven approaches - such as KV cache quantization, token dropping, prompt compression, linear-time sequence models, and hybrid architectures - have been proposed to produce efficient yet long context-capable models. Despite these advancements, no existing work has comprehensively benchmarked these methods in a reasonably aligned environment. In this work, we fill this gap by providing a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks. Our work reveals numerous previously unknown phenomena and offers insights - as well as a friendly workbench - for the future development of long context-capable LLMs. The source code is available at https://github.com/henryzhongsc/longctx_bench.
