Table of Contents
Fetching ...

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu

TL;DR

This work addresses the challenge of maintaining long-context capabilities in LLMs by benchmarking a broad taxonomy of efficiency methods for KV cache management, including quantization, token dropping, and prompt compression. It conducts a large-scale, reproducible evaluation across 65 settings and 7 task categories on a mix of transformer-based LLMs and linear-time architectures, revealing that KV cache quantization generally offers robust, cross-task performance while token dropping excels on coding tasks, and that hybrid architectures with some attention components improve long-context retrieval. The authors provide a minimalistic, extensible benchmark platform and actionable insights for future development, highlighting the importance of preserving prefill fidelity, the trade-offs between compression and memory, and the gap in needle-in-a-haystack robustness. The results have practical implications for deploying long-context LLMs in real systems, guiding method selection based on task type and hardware constraints while underscoring areas where further improvements are needed.

Abstract

Long context capability is a crucial competency for large language models (LLMs) as it mitigates the human struggle to digest long-form texts. This capability enables complex task-solving scenarios such as book summarization, code assistance, and many more tasks that are traditionally manpower-intensive. However, transformer-based LLMs face significant challenges with long context input due to the growing size of the KV cache and the intrinsic complexity of attending to extended inputs; where multiple schools of efficiency-driven approaches - such as KV cache quantization, token dropping, prompt compression, linear-time sequence models, and hybrid architectures - have been proposed to produce efficient yet long context-capable models. Despite these advancements, no existing work has comprehensively benchmarked these methods in a reasonably aligned environment. In this work, we fill this gap by providing a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks. Our work reveals numerous previously unknown phenomena and offers insights - as well as a friendly workbench - for the future development of long context-capable LLMs. The source code is available at https://github.com/henryzhongsc/longctx_bench.

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

TL;DR

This work addresses the challenge of maintaining long-context capabilities in LLMs by benchmarking a broad taxonomy of efficiency methods for KV cache management, including quantization, token dropping, and prompt compression. It conducts a large-scale, reproducible evaluation across 65 settings and 7 task categories on a mix of transformer-based LLMs and linear-time architectures, revealing that KV cache quantization generally offers robust, cross-task performance while token dropping excels on coding tasks, and that hybrid architectures with some attention components improve long-context retrieval. The authors provide a minimalistic, extensible benchmark platform and actionable insights for future development, highlighting the importance of preserving prefill fidelity, the trade-offs between compression and memory, and the gap in needle-in-a-haystack robustness. The results have practical implications for deploying long-context LLMs in real systems, guiding method selection based on task type and hardware constraints while underscoring areas where further improvements are needed.

Abstract

Long context capability is a crucial competency for large language models (LLMs) as it mitigates the human struggle to digest long-form texts. This capability enables complex task-solving scenarios such as book summarization, code assistance, and many more tasks that are traditionally manpower-intensive. However, transformer-based LLMs face significant challenges with long context input due to the growing size of the KV cache and the intrinsic complexity of attending to extended inputs; where multiple schools of efficiency-driven approaches - such as KV cache quantization, token dropping, prompt compression, linear-time sequence models, and hybrid architectures - have been proposed to produce efficient yet long context-capable models. Despite these advancements, no existing work has comprehensively benchmarked these methods in a reasonably aligned environment. In this work, we fill this gap by providing a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks. Our work reveals numerous previously unknown phenomena and offers insights - as well as a friendly workbench - for the future development of long context-capable LLMs. The source code is available at https://github.com/henryzhongsc/longctx_bench.
Paper Structure (44 sections, 1 equation, 24 figures, 9 tables)

This paper contains 44 sections, 1 equation, 24 figures, 9 tables.

Figures (24)

  • Figure 1: The rador plot of different methods (a) Llama-3-8B Llama-3-8B w./ Quant. (b) Llama-3-8B w./ Token Dropping (c) Linear-time sequence models and mixed Architecture (d) Llama-3-8B w./ Prompt compression.
  • Figure 2: $\mathrm{H_2O}$ with different compression ratios on three commonly used LLMs.
  • Figure 3: Needle-in-a-Haystack results on Llama-3-8B-Instruct, linear-time sequence models, and mixed architectures. The best method in each school of approaches is featured with comparable compression ratios. The same length of input might convert to different numbers of tokens per different models, as noted in the upper right corners.
  • Figure 4: Performance of KV cache quantization, token dropping, prompt compression, and other architectures on LongBench.
  • Figure 5: Baseline performance under needle test on three commonly used LLMs
  • ...and 19 more figures