LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
Wenbo Wu, Qingyi Si, Xiurui Pan, Ye Wang, Jie Zhang
TL;DR
LouisKV addresses the memory and efficiency bottlenecks of KV cache retrieval in long-context LLMs by introducing semantic-aware adaptive retrieval and decoupled fine-grained KV management. It leverages strong temporal locality of critical KVs to trigger retrieval at semantic boundaries and uses clustering for input KVs and temporal segmentation for output KVs, complemented by kernel-level optimizations. Across long-input and long-output tasks, LouisKV achieves up to 4.7x end-to-end latency speedups while maintaining near-lossless accuracy compared to full-cache baselines and outperforming existing KV retrieval methods. The approach enables efficient KV caching across diverse long-sequence tasks and has practical implications for scalable deployment of reasoning-heavy LLMs.
Abstract
While Key-Value (KV) cache succeeds in reducing redundant computations in auto-regressive models, it introduces significant memory overhead, limiting its practical deployment in long-sequence scenarios. Existing KV retrieval methods mitigate this by dynamically retaining only a subset of KV entries on the GPU. However, they still suffer from notable efficiency and accuracy bottlenecks due to per-token retrieval and coarse-grained page-level KV management, especially in long-output reasoning scenarios. With the emergence of large reasoning models, efficiently handling such scenarios has become increasingly important. To address this issue, we present two key observations: (1) critical KVs exhibit strong temporal locality during decoding, and (2) these KVs exhibit distinct distribution patterns across the input prompt and generated output. Building on these observations, we propose LouisKV, an efficient KV cache retrieval framework designed for various long-sequence scenarios. Specifically, LouisKV introduces a semantic-aware retrieval strategy leveraging temporal locality to trigger retrieval only at semantic boundaries, drastically reducing computation and data transfer overhead. LouisKV also designs a decoupled, fine-grained management scheme that tailors differentiated strategies for input and output sequences to create retrieval units that better match the model's attention patterns, enabling precise identification of critical KVs. Furthermore, to boost efficiency, LouisKV incorporates several kernel-level optimizations, including custom Triton and CUDA kernels to accelerate the KV clustering and retrieval. Evaluations show that LouisKV achieves up to 4.7$\times$ speedup over state-of-the-art KV retrieval methods while maintaining near-lossless accuracy across diverse long-sequence tasks, including long-input short-output, short-input long-output, and long-input long-output scenarios.
