Table of Contents
Fetching ...

CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing

Kuan Lu, Shuhang Lin, Sai Wu, Yichen Yao, Junhan Yang, Huan Li, Wei Chu, Xu Yinghui, Yuan Qi, Gang Chen

TL;DR

CTkvr tackles the memory and latency bottlenecks of KV caches in long-context LLM inference by introducing a centroid-then-token KV retrieval pipeline. It builds a lightweight query-centroid index (qcIVF) during prefilling, then refines to top-K keys at the token level, while offloading most KV operations to CPU DRAM and overlapping CPU-GPU execution. The approach yields near FullKV accuracy (≤1% degradation) and substantial throughput gains (3×–4×) across Llama-3-8B and Yi-9B at 96K context, with strong scalability to ultra-long contexts. Extensive experiments, ablations, and comparisons show CTkvr outperforms eviction, block-level, and other token-level methods, while offering vastly faster index construction than Faiss ANN methods and robust compatibility with efficient prefilling techniques.

Abstract

Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy: lightweight centroids are precomputed during prefilling for centroid-grained indexing, followed by token-level refinement for precise KV retrieval. This approach balances retrieval efficiency and accuracy. To further enhance performance, we implement an optimized system for indexing construction and search using CPU-GPU co-execution. Experimentally, CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation. Meanwhile, CTKVR delivers 3 times and 4 times throughput speedups on Llama-3-8B and Yi-9B at 96K context length across diverse GPU hardware.

CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing

TL;DR

CTkvr tackles the memory and latency bottlenecks of KV caches in long-context LLM inference by introducing a centroid-then-token KV retrieval pipeline. It builds a lightweight query-centroid index (qcIVF) during prefilling, then refines to top-K keys at the token level, while offloading most KV operations to CPU DRAM and overlapping CPU-GPU execution. The approach yields near FullKV accuracy (≤1% degradation) and substantial throughput gains (3×–4×) across Llama-3-8B and Yi-9B at 96K context, with strong scalability to ultra-long contexts. Extensive experiments, ablations, and comparisons show CTkvr outperforms eviction, block-level, and other token-level methods, while offering vastly faster index construction than Faiss ANN methods and robust compatibility with efficient prefilling techniques.

Abstract

Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy: lightweight centroids are precomputed during prefilling for centroid-grained indexing, followed by token-level refinement for precise KV retrieval. This approach balances retrieval efficiency and accuracy. To further enhance performance, we implement an optimized system for indexing construction and search using CPU-GPU co-execution. Experimentally, CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation. Meanwhile, CTKVR delivers 3 times and 4 times throughput speedups on Llama-3-8B and Yi-9B at 96K context length across diverse GPU hardware.

Paper Structure

This paper contains 41 sections, 3 theorems, 1 equation, 14 figures, 14 tables, 2 algorithms.

Key Result

Lemma 1

Let $A = [a_1, a_2, \ldots, a_n]$ be an array of length $n$, and let $B = [b_1, b_2, \ldots, b_n]$ be a rearrangement of $A$ such that the number of inversions $t = \left| \{(i, j) \mid i < j \text{ and } a_i > a_j \text{ and } b_i < b_j\} \right|$; then, for any $1 \leq m \leq n$, the set $S_m = \{

Figures (14)

  • Figure 1: Illustration of KV cache compression methods and their comparison based on accuracy drop and throughput.
  • Figure 2: Analysis of query vector similarity: (a) Query distance proximity correlates positively with cosine similarity, with consistent trends across datasets. (b) The overlap of top-$k$ retrieved Keys is approximately positively correlated with query cosine similarity.
  • Figure 3: The main components of CTkvr, the pseudocode of (a) and (b)(c)(d) are separately provided in \ref{['alg:DIDX-prefill']} and \ref{['alg:DIDX-decoding']}.
  • Figure 4: CTkvr optimizes long-context decoding throughput by offloading most of the KV cache to the CPU during prefilling and enabling GPU-CPU co-execution for efficient attention computation.
  • Figure 5: Performance comparison of FullKV and CTkvr using heatmaps, following methodology originated from the Needle-in-a-haystack paper.
  • ...and 9 more figures

Theorems & Definitions (3)

  • Lemma 1
  • Lemma 2
  • Theorem 1