Table of Contents
Fetching ...

LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences

Wenbo Wu, Qingyi Si, Xiurui Pan, Ye Wang, Jie Zhang

TL;DR

LouisKV addresses the memory and efficiency bottlenecks of KV cache retrieval in long-context LLMs by introducing semantic-aware adaptive retrieval and decoupled fine-grained KV management. It leverages strong temporal locality of critical KVs to trigger retrieval at semantic boundaries and uses clustering for input KVs and temporal segmentation for output KVs, complemented by kernel-level optimizations. Across long-input and long-output tasks, LouisKV achieves up to 4.7x end-to-end latency speedups while maintaining near-lossless accuracy compared to full-cache baselines and outperforming existing KV retrieval methods. The approach enables efficient KV caching across diverse long-sequence tasks and has practical implications for scalable deployment of reasoning-heavy LLMs.

Abstract

While Key-Value (KV) cache succeeds in reducing redundant computations in auto-regressive models, it introduces significant memory overhead, limiting its practical deployment in long-sequence scenarios. Existing KV retrieval methods mitigate this by dynamically retaining only a subset of KV entries on the GPU. However, they still suffer from notable efficiency and accuracy bottlenecks due to per-token retrieval and coarse-grained page-level KV management, especially in long-output reasoning scenarios. With the emergence of large reasoning models, efficiently handling such scenarios has become increasingly important. To address this issue, we present two key observations: (1) critical KVs exhibit strong temporal locality during decoding, and (2) these KVs exhibit distinct distribution patterns across the input prompt and generated output. Building on these observations, we propose LouisKV, an efficient KV cache retrieval framework designed for various long-sequence scenarios. Specifically, LouisKV introduces a semantic-aware retrieval strategy leveraging temporal locality to trigger retrieval only at semantic boundaries, drastically reducing computation and data transfer overhead. LouisKV also designs a decoupled, fine-grained management scheme that tailors differentiated strategies for input and output sequences to create retrieval units that better match the model's attention patterns, enabling precise identification of critical KVs. Furthermore, to boost efficiency, LouisKV incorporates several kernel-level optimizations, including custom Triton and CUDA kernels to accelerate the KV clustering and retrieval. Evaluations show that LouisKV achieves up to 4.7$\times$ speedup over state-of-the-art KV retrieval methods while maintaining near-lossless accuracy across diverse long-sequence tasks, including long-input short-output, short-input long-output, and long-input long-output scenarios.

LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences

TL;DR

LouisKV addresses the memory and efficiency bottlenecks of KV cache retrieval in long-context LLMs by introducing semantic-aware adaptive retrieval and decoupled fine-grained KV management. It leverages strong temporal locality of critical KVs to trigger retrieval at semantic boundaries and uses clustering for input KVs and temporal segmentation for output KVs, complemented by kernel-level optimizations. Across long-input and long-output tasks, LouisKV achieves up to 4.7x end-to-end latency speedups while maintaining near-lossless accuracy compared to full-cache baselines and outperforming existing KV retrieval methods. The approach enables efficient KV caching across diverse long-sequence tasks and has practical implications for scalable deployment of reasoning-heavy LLMs.

Abstract

While Key-Value (KV) cache succeeds in reducing redundant computations in auto-regressive models, it introduces significant memory overhead, limiting its practical deployment in long-sequence scenarios. Existing KV retrieval methods mitigate this by dynamically retaining only a subset of KV entries on the GPU. However, they still suffer from notable efficiency and accuracy bottlenecks due to per-token retrieval and coarse-grained page-level KV management, especially in long-output reasoning scenarios. With the emergence of large reasoning models, efficiently handling such scenarios has become increasingly important. To address this issue, we present two key observations: (1) critical KVs exhibit strong temporal locality during decoding, and (2) these KVs exhibit distinct distribution patterns across the input prompt and generated output. Building on these observations, we propose LouisKV, an efficient KV cache retrieval framework designed for various long-sequence scenarios. Specifically, LouisKV introduces a semantic-aware retrieval strategy leveraging temporal locality to trigger retrieval only at semantic boundaries, drastically reducing computation and data transfer overhead. LouisKV also designs a decoupled, fine-grained management scheme that tailors differentiated strategies for input and output sequences to create retrieval units that better match the model's attention patterns, enabling precise identification of critical KVs. Furthermore, to boost efficiency, LouisKV incorporates several kernel-level optimizations, including custom Triton and CUDA kernels to accelerate the KV clustering and retrieval. Evaluations show that LouisKV achieves up to 4.7 speedup over state-of-the-art KV retrieval methods while maintaining near-lossless accuracy across diverse long-sequence tasks, including long-input short-output, short-input long-output, and long-input long-output scenarios.

Paper Structure

This paper contains 26 sections, 3 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of Arkvale and LouisKV. Arkvale adopts page-level KV management and retrieves critical pages from the CPU for every decoding token, leading to high transfer overhead and potential accuracy degradation. In contrast, LouisKV reuses critical KVs by exploiting temporal locality to significantly reduce retrieval frequency. It also employs a decoupled KV management scheme, enabling precise retrieval to improve transfer efficiency while maintaining high accuracy.
  • Figure 2: Accuracy and efficiency comparison on various long-sequence tasks. (a) LouisKV achieves accuracy comparable to FullCache (the lossless baseline) and superior to state-of-the-art retrieval methods. (b) LouisKV significantly reduces inference latency, substantially outperforming state-of-the-art retrieval methods, while also avoiding the Out-Of-Memory errors that FullCache faces in long-sequence and large batch size scenarios.
  • Figure 3: Access patterns of critical KVs in different long-sequence tasks. First, both tasks demonstrate temporal locality: during the generation of a coherent segment (Current Segment), the similarity of critical KV sets maintains high values. Second, they reveal a distinct spatial distribution: (a) in a long-document task, attention is sparsely distributed in the prompt, whereas (b) in a mathematical reasoning task, attention is densely focused on some intermediate steps in the previous output.
  • Figure 4: The design of LouisKV. (a) Retrieval is triggered when the query similarity $r$ drops below a threshold $\tau$, loading critical KV entries from the CPU cache pool. (b) During prefilling, K-means clustering is employed to group semantically similar KVs into clusters. During decoding, consecutively generated KVs are partitioned into temporal segments. These clusters and segments are then offloaded to a unified cache pool on the CPU. The detailed algorithm is provided in Appendix \ref{['append-algorithm']}.
  • Figure 5: Performance comparison of LouisKV against four baseline methods across various cache budgets. Subplots (a-f) present the results on six long-input understanding tasks using Llama-3.1-8B-Instruct. Subplots (g-l) display the results on six long-output reasoning tasks using Qwen3-8B. For detailed experimental results on more models, please refer to Appendix \ref{['append:experiments-acc']}.
  • ...and 7 more figures