Table of Contents
Fetching ...

Learning What to Write: Write-Gated KV for Efficient Long-Context Inference

Yen-Chieh Huang, Pi-Cheng Hsiu, Rui Fang, Ming-Syan Chen

TL;DR

<3-5 sentence high-level summary>

Abstract

Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45$\times$ prefill and 1.89-2.56$\times$ decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV .

Learning What to Write: Write-Gated KV for Efficient Long-Context Inference

TL;DR

<3-5 sentence high-level summary>

Abstract

Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45 prefill and 1.89-2.56 decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV .

Paper Structure

This paper contains 33 sections, 1 theorem, 21 equations, 9 figures, 1 table.

Key Result

Proposition 1.1

For any $\lambda > 0$, let $\theta^*(\lambda)$ be a global minimizer of the unconstrained problem defined in Eq. (eq:unconstrained). Then, $\theta^*(\lambda)$ is also a global minimizer of the constrained problem defined in Eq. (eq:constrained) where the budget is set to $B = \mathcal{M}(\theta^*(\l

Figures (9)

  • Figure 1: Attention Bottleneck in Long-Context Inference. Measured on Llama 3.1 8B (Batch Size=1) on an H200 GPU. As sequence length increases, the computational cost of attention dominates (a) prefill latency, while the overhead of the KV cache dominates both (b) decode latency and (c) memory usage.
  • Figure 2: (a) Synergy of KV Admission and Selection. (b) Transient Utility of Tokens. Visualization of attention scores received by a representative "low-utility" token over time. While the token is ignored by distant future queries, it remains highly relevant within a short local window.
  • Figure 3: Heterogeneity of Token Utility. Attention patterns show that utility is both sparse and head-specific.
  • Figure 4: Memory Fragmentation with Ragged KV States.
  • Figure 5: Overview of Write-Gated KV.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1.1
  • proof