Table of Contents
Fetching ...

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices

Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu

TL;DR

Locret addresses the challenge of long-context inference on consumer-grade devices by introducing a lightweight, training-based KV cache eviction policy realized through trainable retaining heads that predict a causal importance score (CIS) for each KV unit. The framework integrates with a chunked prefill pipeline, using stabilizers to preserve local continuity and enable memory budgets to be respected during streaming input. A query-aware variant, Locret-Q, extends the approach to query-driven tasks, achieving faster prefill and robust performance on benchmarks like RULER. Across Phi-3-mini-128K and Llama-3.1-8B-instruct, Locret demonstrates substantial KV cache compression (up to $20\times$) with minimal quality loss, enabling up to 128K+ context on a single 4090 GPU and showing strong practicality for democratizing long-context LLM use on consumer hardware.

Abstract

Scaling the input context length of a large language model (LLM) incurs a significant increase in computation cost and memory footprint to maintain the attention key-value (KV) cache. Existing KV cache compression methods suffer from inefficient compression strategies and limited memory reduction effects, making it difficult for LLMs to conduct long-context inference on consumer-grade devices, especially when inferring long-context stream input. Such obstacles prevent consumer-grade devices from supporting more complex applications, creating challenges for the democratization of LLMs. To overcome this, we propose Locret, the first framework to create an eviction policy compatible with chunked prefill. By evaluating the causal importance of KV cache units by learnable retaining heads, Locret enables precise eviction of cache units, facilitating efficient long-context inference. In our extensive empirical studies, Locret outperforms the recent popular and competitive approaches in terms of memory efficiency and generation quality -- Locret achieves up to 20x of KV cache compression ratio within less than 10% performance loss. Furthermore, Locret achieves 128K+ long-context inference on a single NVIDIA 4090 GPU without compromising generation quality and only costs <1 GPU hour of additional training.

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices

TL;DR

Locret addresses the challenge of long-context inference on consumer-grade devices by introducing a lightweight, training-based KV cache eviction policy realized through trainable retaining heads that predict a causal importance score (CIS) for each KV unit. The framework integrates with a chunked prefill pipeline, using stabilizers to preserve local continuity and enable memory budgets to be respected during streaming input. A query-aware variant, Locret-Q, extends the approach to query-driven tasks, achieving faster prefill and robust performance on benchmarks like RULER. Across Phi-3-mini-128K and Llama-3.1-8B-instruct, Locret demonstrates substantial KV cache compression (up to ) with minimal quality loss, enabling up to 128K+ context on a single 4090 GPU and showing strong practicality for democratizing long-context LLM use on consumer hardware.

Abstract

Scaling the input context length of a large language model (LLM) incurs a significant increase in computation cost and memory footprint to maintain the attention key-value (KV) cache. Existing KV cache compression methods suffer from inefficient compression strategies and limited memory reduction effects, making it difficult for LLMs to conduct long-context inference on consumer-grade devices, especially when inferring long-context stream input. Such obstacles prevent consumer-grade devices from supporting more complex applications, creating challenges for the democratization of LLMs. To overcome this, we propose Locret, the first framework to create an eviction policy compatible with chunked prefill. By evaluating the causal importance of KV cache units by learnable retaining heads, Locret enables precise eviction of cache units, facilitating efficient long-context inference. In our extensive empirical studies, Locret outperforms the recent popular and competitive approaches in terms of memory efficiency and generation quality -- Locret achieves up to 20x of KV cache compression ratio within less than 10% performance loss. Furthermore, Locret achieves 128K+ long-context inference on a single NVIDIA 4090 GPU without compromising generation quality and only costs <1 GPU hour of additional training.
Paper Structure (38 sections, 1 theorem, 4 equations, 10 figures, 20 tables, 1 algorithm)

This paper contains 38 sections, 1 theorem, 4 equations, 10 figures, 20 tables, 1 algorithm.

Key Result

Theorem 14.4

(Calculating cache units with Top-$b$ CIS is a cache problem.) Given a causal calculation $f = g\circ Sel$, and its generated sequence $\{c_i\}$, a CIS $s_i = h(c_i)$ and a positive number $b\in \mathbb{Z}_+$, if the selection function $Sel$ satisfies the following condition, then $(f, b, \{c_i\})$ is a cache problem with budget $b$.

Figures (10)

  • Figure 1: For each prefix length of the context, this figure shows the consistency in evaluating the token importance of the prefix based on the full context and based on only the prefix without subsequent tokens. The consistency is defined as the intersection of the top 10% tokens of two evaluation methods divided by the number of top 10% tokens in the prefix. More details are in Appendix \ref{['append:discrepancy']}.
  • Figure 2: The framework of Locret. "$\mathbf{R}$" represents the retaining head. $\text{P}_i$ and $\text{A}_i$ correspond to the $i$-th prompt token and answer token. "t" represents the time step in chunked prefill, "$b$" represents the budget size, and "$n_{s}$" represents the length of the stabilizers. For simplicity, our notation here does not reflect the concept of layers.
  • Figure 3: R.Number with different stabilizer lengths $n_s$. (a) Task accuracy under different $n_s$. (b) Maximum absolute error of the last hidden state. (c) Mean absolute error of the predicted CIS. We conduct this experiment on entries 101-120 of R.Number using the Phi-3-mini-128K backbone.
  • Figure 4: Memory Statistics vs. Task Performance. The red lines correspond to the theoretical size of the model weights, while the blue lines represent the total size of the model weights and the full KV cache without any compression. The purple lines indicate the accuracies of FullAttn. "Total Memory" represents the total memory usage of both GPU and CPU.
  • Figure 5: Scores of Locret under (a) various budgets; (b) various $n_s$; (c) various chunk size.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Definition 14.1
  • Definition 14.2
  • Definition 14.3
  • Theorem 14.4