Table of Contents
Fetching ...

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo

TL;DR

This work tackles the memory and latency challenges of KV cache in long-context LLM inference by introducing ClusterKV, a recallable KV compression method that operates at semantic-cluster granularity. By clustering key vectors in semantic space and selecting clusters based on attention to the current query, ClusterKV reduces recall overhead while preserving accuracy. The system design couples GPU-based clustering with CPU-backed KV storage and a cluster-level GPU cache, employing batched kernels and asynchronous execution to minimize overhead. Empirical results on 32k context windows show near full KV accuracy at 1k–2k budgets, with up to 2x latency improvements and 2.5x decoding throughput gains compared to state-of-the-art recallable methods.

Abstract

Large Language Models (LLMs) have been widely deployed in a variety of applications, and the context length is rapidly increasing to handle tasks such as long-document QA and complex logical reasoning. However, long context poses significant challenges for inference efficiency, including high memory costs of key-value (KV) cache and increased latency due to extensive memory accesses. Recent works have proposed compressing KV cache to approximate computation, but these methods either evict tokens permanently, never recalling them for later inference, or recall previous tokens at the granularity of pages divided by textual positions. Both approaches degrade the model accuracy and output quality. To achieve efficient and accurate recallable KV cache compression, we introduce ClusterKV, which recalls tokens at the granularity of semantic clusters. We design and implement efficient algorithms and systems for clustering, selection, indexing and caching. Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths, using only a 1k to 2k KV cache budget, and achieves up to a 2$\times$ speedup in latency and a 2.5$\times$ improvement in decoding throughput. Compared to SoTA recallable KV compression methods, ClusterKV demonstrates higher model accuracy and output quality, while maintaining or exceeding inference efficiency. Our code is available at https://github.com/sjtu-zhao-lab/ClusterKV.

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

TL;DR

This work tackles the memory and latency challenges of KV cache in long-context LLM inference by introducing ClusterKV, a recallable KV compression method that operates at semantic-cluster granularity. By clustering key vectors in semantic space and selecting clusters based on attention to the current query, ClusterKV reduces recall overhead while preserving accuracy. The system design couples GPU-based clustering with CPU-backed KV storage and a cluster-level GPU cache, employing batched kernels and asynchronous execution to minimize overhead. Empirical results on 32k context windows show near full KV accuracy at 1k–2k budgets, with up to 2x latency improvements and 2.5x decoding throughput gains compared to state-of-the-art recallable methods.

Abstract

Large Language Models (LLMs) have been widely deployed in a variety of applications, and the context length is rapidly increasing to handle tasks such as long-document QA and complex logical reasoning. However, long context poses significant challenges for inference efficiency, including high memory costs of key-value (KV) cache and increased latency due to extensive memory accesses. Recent works have proposed compressing KV cache to approximate computation, but these methods either evict tokens permanently, never recalling them for later inference, or recall previous tokens at the granularity of pages divided by textual positions. Both approaches degrade the model accuracy and output quality. To achieve efficient and accurate recallable KV cache compression, we introduce ClusterKV, which recalls tokens at the granularity of semantic clusters. We design and implement efficient algorithms and systems for clustering, selection, indexing and caching. Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths, using only a 1k to 2k KV cache budget, and achieves up to a 2 speedup in latency and a 2.5 improvement in decoding throughput. Compared to SoTA recallable KV compression methods, ClusterKV demonstrates higher model accuracy and output quality, while maintaining or exceeding inference efficiency. Our code is available at https://github.com/sjtu-zhao-lab/ClusterKV.

Paper Structure

This paper contains 21 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: Comparison of KV compression methods. Green boxes represent tokens selected for attention computation.
  • Figure 2: Semantic space and attention weights of tokens in the last step of Fig. \ref{['fig:intro-cmp']}d. Lighter boxes indicate larger weights.
  • Figure 3: (a) Variation in token importance across decoding steps with a context length of 8192. (b) Internal fragmentation of important tokens at the granularity of pages ($page\_size=16$).
  • Figure 4: Clustering and selection process. Green dots represent key vectors, and purple dots represent centroids of clusters.
  • Figure 5: System overview of ClusterKV. The green box represents components running on the GPU.
  • ...and 8 more figures