Table of Contents
Fetching ...

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen

TL;DR

LESS tackles the KV cache memory bottleneck in LLM inference by fusing eviction-based sparse KV caching with a constant-sized, learnable low-rank state. Through per-layer training of lightweight kernels $\phi$ and $\psi$, LESS synthesizes the discarded information into a persistent state $(\mathbf{H}_t, \mathbf{z}_t)$, enabling near-full-cache attention with significantly reduced memory and computation. Empirical results across language modeling, classification, and summarization on Llama 2 and Falcon show substantial performance recovery relative to sparse baselines and, in many cases, matching full caching, while delivering latency reductions and higher throughput. The approach requires minimal architectural changes, scalable per-layer training, and demonstrates strong potential for enabling efficient, long-context LLM deployment.

Abstract

Many computational factors limit broader deployment of large language models. In this paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a computational shortcut that requires storing previous KV pairs during decoding. While existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs to dramatically reduce the memory footprint of the cache, they can have limited success in tasks that require recollecting a majority of previous tokens. To alleviate this issue, we propose LESS, a simple integration of a (nearly free) constant sized cache with eviction-based cache methods, such that all tokens can be queried at later decoding steps. Its ability to retain information throughout time shows merit on a variety of tasks where we demonstrate LESS can help reduce the performance gap from caching everything, sometimes even matching it, all while being efficient. Relevant code can be found at https://github.com/hdong920/LESS.

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

TL;DR

LESS tackles the KV cache memory bottleneck in LLM inference by fusing eviction-based sparse KV caching with a constant-sized, learnable low-rank state. Through per-layer training of lightweight kernels and , LESS synthesizes the discarded information into a persistent state , enabling near-full-cache attention with significantly reduced memory and computation. Empirical results across language modeling, classification, and summarization on Llama 2 and Falcon show substantial performance recovery relative to sparse baselines and, in many cases, matching full caching, while delivering latency reductions and higher throughput. The approach requires minimal architectural changes, scalable per-layer training, and demonstrates strong potential for enabling efficient, long-context LLM deployment.

Abstract

Many computational factors limit broader deployment of large language models. In this paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a computational shortcut that requires storing previous KV pairs during decoding. While existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs to dramatically reduce the memory footprint of the cache, they can have limited success in tasks that require recollecting a majority of previous tokens. To alleviate this issue, we propose LESS, a simple integration of a (nearly free) constant sized cache with eviction-based cache methods, such that all tokens can be queried at later decoding steps. Its ability to retain information throughout time shows merit on a variety of tasks where we demonstrate LESS can help reduce the performance gap from caching everything, sometimes even matching it, all while being efficient. Relevant code can be found at https://github.com/hdong920/LESS.
Paper Structure (26 sections, 7 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 7 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Toy (top row) and Llama 2 7B (bottom row) example decoder attention maps with $\mathop{\mathrm{\text{H}_2 \text{O}}}\nolimits$ as the underlying sparse policy. In the top row, red/pink and grey squares are positive and zero attention probabilities, respectively. In the bottom row, darker colors indicate larger attention probabilities. Sparse attention policies zero out many positive attention probabilities. Our method, LESS, ensures all previous tokens will have some contribution to the attention layer output to better retain information.
  • Figure 2: Incorrect summary by Falcon 7B with sparse policy $\mathop{\mathrm{\text{H}_2 \text{O}}}\nolimits$.
  • Figure 3: Attention residuals exploration in Llama 2 7B on WikiText merity2016pointer. Mean and 1000 sample relative singular value plots of true attention outputs and residuals from top-$512$ sparse policy, showing the residual is much lower rank (left). End-to-end performance (lower is better) using top-$k$ caching with and without low-rank approximations (right). A rank-4 approximation virtually recovers the original performance.
  • Figure 4: LESS algorithm during inference. At each decoding step, attention is calculated as in \ref{['eq:attn_approx']}. To prepare for the next decoding step, the cache is updated by placing the most recent KV pair into the sparse policy cache, and if it has exceeded capacity, a KV pair will be evicted and integrated into the low-rank cache $\bm{H}_t$ before being deleted.
  • Figure 5: Experimental setup. First, a sparse policy is chosen as the underlying policy behind all methods. Then, we compare performance among the full cache model, Baseline, Baseline+, and LESS. Baseline+ and LESS use the same amount of storage which is slightly larger than the requirements of Baseline.
  • ...and 8 more figures