Table of Contents
Fetching ...

CacheClip: Accelerating RAG with Effective KV Cache Reuse

Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu

TL;DR

CacheClip addresses the Retrieval-Augmented Generation TTFT bottleneck by combining an auxiliary, lightweight LLM for token-importance selection with a shared-prefix calibration and a grouping strategy for KV-cache updates. It preserves the original local KV caches while selectively recomputing a small, context-aware set of tokens to restore inter-chunk attention and cross-document reasoning. The approach yields strong empirical gains, retaining up to 94.8% of full-attention on NIAH and 85.0% on LongBench, and delivering up to 1.92× faster prefill times, outperforming prior methods like APE and CacheBlend. The design leverages CPU-based auxiliary processing, position-ID rearrangement, and careful token mapping to maintain contextual coherence, making it practical for production RAG systems with long contexts.

Abstract

Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates three techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, where the auxiliary model is finetuned to improve selection accuracy, (2) shared prefixes to eliminate redundant attention sinks, and (3) grouping strategy to maintain local coherence during partial KV cache updates. Experiments show CacheClip retains up to 94.8% and 85.0% of full-attention performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2% and 35.1% on NIAH (with reomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 1.92x in prefill time, providing a practical solution to the efficiency-quality trade-off in RAG systems.

CacheClip: Accelerating RAG with Effective KV Cache Reuse

TL;DR

CacheClip addresses the Retrieval-Augmented Generation TTFT bottleneck by combining an auxiliary, lightweight LLM for token-importance selection with a shared-prefix calibration and a grouping strategy for KV-cache updates. It preserves the original local KV caches while selectively recomputing a small, context-aware set of tokens to restore inter-chunk attention and cross-document reasoning. The approach yields strong empirical gains, retaining up to 94.8% of full-attention on NIAH and 85.0% on LongBench, and delivering up to 1.92× faster prefill times, outperforming prior methods like APE and CacheBlend. The design leverages CPU-based auxiliary processing, position-ID rearrangement, and careful token mapping to maintain contextual coherence, making it practical for production RAG systems with long contexts.

Abstract

Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates three techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, where the auxiliary model is finetuned to improve selection accuracy, (2) shared prefixes to eliminate redundant attention sinks, and (3) grouping strategy to maintain local coherence during partial KV cache updates. Experiments show CacheClip retains up to 94.8% and 85.0% of full-attention performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2% and 35.1% on NIAH (with reomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 1.92x in prefill time, providing a practical solution to the efficiency-quality trade-off in RAG systems.

Paper Structure

This paper contains 32 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of CacheClip.
  • Figure 2: Comparison of Attention Similarity between Qwen2.5-0.5B and Qwen2.5-7B on 500 Sequences per Input Length from 2WikiMultihopQAxanh2020_2wikimultihop
  • Figure 3: Workflow of CacheClip.
  • Figure 4: Performance Comparison across RULER Test Cases (Qwen2.5-14B, Input Length = 8192).
  • Figure 5: Impact of Recompute Ratio on RULER Performance (Input Length = 8192).
  • ...and 2 more figures