Table of Contents
Fetching ...

CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation

Kun-Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na

TL;DR

CacheFocus tackles long-context limitations in LLMs by introducing a training-free framework that combines offline, query-independent Context KV caching with cache re-positioning, layer-adaptive pruning, and adaptive positional allocation. The method reuses RoPE properties to re-position cached keys, prunes low-relevance caches across layers, and dynamically allocates positional encodings to maximize encoding space usage, enabling efficient long-context generation. Empirical results on Natural Questions and TriviaQA show CacheFocus outperforms PCW, RAG, and APE baselines, maintaining performance even when input lengths exceed 4K tokens and scaling effectively with tens of thousands of tokens in Qwen2-Instruct settings. The approach reduces prefill and decoding latency and provides a scalable path for practical long-text retrieval-augmented generation without additional training.

Abstract

Large Language Models (LLMs) excel across a variety of language tasks yet are constrained by limited input lengths and high computational costs. Existing approaches\textemdash such as relative positional encodings (e.g., RoPE, ALiBi) and sliding window mechanisms\textemdash partially alleviate these issues but often require additional training or suffer from performance degradation with longer inputs. In this paper, we introduce \textbf{\textit{CacheFocus}}, a method that enhances length normalization and reduces inference latency without any further training. Our approach leverages query-independent, offline caching to efficiently reuse a Context KV Cache Store. We address the amplification of abnormal token distributions problem by re-positioning cached keys and introducing Layer-Adaptive Cache Pruning to discard low-relevance caches during pre-filling. Additionally, our Adaptive Positional Allocation Strategy dynamically reassigns cache positions to maximize the use of the available positional encoding range. Experiments on the Natural Questions and TriviaQA datasets demonstrate that CacheFocus outperforms alternative methods even when inputs exceed the $4$K limit of the \texttt{LLaMA-2} model, emphasizing its practical effectiveness for long-context LLMs. Moreover, even with large maximum input length of \texttt{Qwen2}, the performance of CacheFocus shows that it maintains consistent performance even as the number of documents increases, effectively managing long-text generation without degradation.

CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation

TL;DR

CacheFocus tackles long-context limitations in LLMs by introducing a training-free framework that combines offline, query-independent Context KV caching with cache re-positioning, layer-adaptive pruning, and adaptive positional allocation. The method reuses RoPE properties to re-position cached keys, prunes low-relevance caches across layers, and dynamically allocates positional encodings to maximize encoding space usage, enabling efficient long-context generation. Empirical results on Natural Questions and TriviaQA show CacheFocus outperforms PCW, RAG, and APE baselines, maintaining performance even when input lengths exceed 4K tokens and scaling effectively with tens of thousands of tokens in Qwen2-Instruct settings. The approach reduces prefill and decoding latency and provides a scalable path for practical long-text retrieval-augmented generation without additional training.

Abstract

Large Language Models (LLMs) excel across a variety of language tasks yet are constrained by limited input lengths and high computational costs. Existing approaches\textemdash such as relative positional encodings (e.g., RoPE, ALiBi) and sliding window mechanisms\textemdash partially alleviate these issues but often require additional training or suffer from performance degradation with longer inputs. In this paper, we introduce \textbf{\textit{CacheFocus}}, a method that enhances length normalization and reduces inference latency without any further training. Our approach leverages query-independent, offline caching to efficiently reuse a Context KV Cache Store. We address the amplification of abnormal token distributions problem by re-positioning cached keys and introducing Layer-Adaptive Cache Pruning to discard low-relevance caches during pre-filling. Additionally, our Adaptive Positional Allocation Strategy dynamically reassigns cache positions to maximize the use of the available positional encoding range. Experiments on the Natural Questions and TriviaQA datasets demonstrate that CacheFocus outperforms alternative methods even when inputs exceed the K limit of the \texttt{LLaMA-2} model, emphasizing its practical effectiveness for long-context LLMs. Moreover, even with large maximum input length of \texttt{Qwen2}, the performance of CacheFocus shows that it maintains consistent performance even as the number of documents increases, effectively managing long-text generation without degradation.

Paper Structure

This paper contains 30 sections, 9 equations, 7 figures, 6 tables, 2 algorithms.

Figures (7)

  • Figure 1: The performance of CacheFocus on NQ for LLaMA-2-7B-Chat. This indicates that CacheFocus is not only adequate for extending input length but also showing robust performance.
  • Figure 1: Pre-filling & Decoding
  • Figure 2: An overall architecture of CacheFocus: 1) Offline Query-Independent Parallel Document Caching (§\ref{['sec:query_indepedent_parallel_document_caching']}): Documents are splitted into fixed-length passages and cached along with a shared prefix; 2) Cache Retrieval: Given a query, relevant passages and their caches are retrieved by a retriever; 3) Pre-filling with Layer-Adaptive Cache Pruning (§\ref{['sec:layer_adaptive_cache_pruning']}): The pre-computed caches are positioned within the model's positional encoding range (§\ref{['sec:cache_reposition']}), and Layer-Adaptive Cache Pruning (§\ref{['sec:layer_adaptive_cache_pruning']}) is applied at specific layers based on accumulated attention scores, allowing the model to select semantically relevant documents; 4) Decoding: After pre-filling, the final caches are re-positioned according to the pruned caches via Adaptive Positional Allocation Strategy (§ \ref{['sec:adaptive_positional_allocation_strategy']}), and the model proceeds with decoding, thereby obtaining reduced computational cost and high-quality context.
  • Figure 2: Layer-Adaptive Cache Pruning
  • Figure 3: An illustration of "align" and "sort" strategies in §\ref{['sec:adaptive_positional_allocation_strategy']}. The "align" strategy simply assigns cache positions to available spots in the positional encoding space, whereas the "sort" strategy allocates these positions based on attention scores. Note that all strategies manipulates cache positions closed to query inspired by liu2024LostintheMiddle.
  • ...and 2 more figures