Table of Contents
Fetching ...

WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference

Youhui Zuo, Sibo Wei, Chen Zhang, Zhuorui Liu, Wenpeng Lu, Dawei Song

TL;DR

WindowKV tackles the memory bottleneck of KV caches in long-context LLM inference by introducing a task-adaptive, window-level KV cache selection method coupled with intra-group KV cache index sharing. The approach uses an observation window and task-driven review windows, scoring and selecting high-signal windows per task type (localization vs aggregation) and distributing budgets across layer groups via a pyramid-like allocation. Empirical results on LongBench and Needle-in-a-Haystack show WindowKV achieving state-of-the-art or competitive performance with only about $12\%$ of the original KV cache, along with notable throughput gains and minimal classifier overhead. This work enables efficient, task-aware long-context inference with reduced memory, facilitating industrial deployment of LLMs under tight resource constraints.

Abstract

With the advancements in long-context inference capabilities of large language models (LLMs), the KV cache has become one of the foundational components. However, its substantial GPU memory consumption makes KV cache compression a key technique for enabling efficient LLM inference in industrial scenarios. While recent studies have focused on optimizing the memory occupied by the KV cache, they overlook two critical factors: preserving semantic coherence and considering task-specific characteristic during compression. To address these limitations, we propose a novel task-adaptive KV cache window selection method, WindowKV. WindowKV dynamically selects local semantic windows consisting of consecutive tokens, according to task-specific characteristics, ensuring the retained KV cache captures continuous, essential context. Additionally, we introduce an intra-group layer KV cache indices sharing strategy to reduce computational overhead, achieving a balance between performance and efficiency. We rigorously evaluate WindowKV on the LongBench benchmark, and the results demonstrate that it maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache, significantly reducing memory requirements. Furthermore, our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.

WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference

TL;DR

WindowKV tackles the memory bottleneck of KV caches in long-context LLM inference by introducing a task-adaptive, window-level KV cache selection method coupled with intra-group KV cache index sharing. The approach uses an observation window and task-driven review windows, scoring and selecting high-signal windows per task type (localization vs aggregation) and distributing budgets across layer groups via a pyramid-like allocation. Empirical results on LongBench and Needle-in-a-Haystack show WindowKV achieving state-of-the-art or competitive performance with only about of the original KV cache, along with notable throughput gains and minimal classifier overhead. This work enables efficient, task-aware long-context inference with reduced memory, facilitating industrial deployment of LLMs under tight resource constraints.

Abstract

With the advancements in long-context inference capabilities of large language models (LLMs), the KV cache has become one of the foundational components. However, its substantial GPU memory consumption makes KV cache compression a key technique for enabling efficient LLM inference in industrial scenarios. While recent studies have focused on optimizing the memory occupied by the KV cache, they overlook two critical factors: preserving semantic coherence and considering task-specific characteristic during compression. To address these limitations, we propose a novel task-adaptive KV cache window selection method, WindowKV. WindowKV dynamically selects local semantic windows consisting of consecutive tokens, according to task-specific characteristics, ensuring the retained KV cache captures continuous, essential context. Additionally, we introduce an intra-group layer KV cache indices sharing strategy to reduce computational overhead, achieving a balance between performance and efficiency. We rigorously evaluate WindowKV on the LongBench benchmark, and the results demonstrate that it maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache, significantly reducing memory requirements. Furthermore, our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.

Paper Structure

This paper contains 24 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of WindowKV with state-of-the-art KV cache compression methods. (a) Full KV retains all tokens in the KV cache for each layer, with cache size growing linearly with input length. (b) H2O maintains a fixed cache size across layers, selecting tokens based on attention scores. (c) PyramidKV adopts a pyramid-shaped cache structure, allocating varying cache budgets to different layers. These methods uniformly apply token level selection strategies across all tasks. (d) WindowKV, in contrast, introduces a task-adaptive window selection method combined with intra-group layer KV cache indices sharing strategy, dynamically allocating group budgets across different groups.
  • Figure 2: Similarity of Intra-Group Layer KV Cache Indices.
  • Figure 3: Needle-in-a-Haystack for LLaMA3-8B-Instruct with 512 KV cache size at 8K context length.
  • Figure 4: Impact of task-adaptive window selection and review window size on WindowKV performance.