Table of Contents
Fetching ...

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

Giulio Corallo, Orion Weller, Fabio Petroni, Paolo Papotti

TL;DR

This work identifies fundamental limitations in retrieval-based and fully contextual approaches for knowledge-intensive reasoning in LLMs. It introduces task-aware, query-agnostic KV-cache compression that offline-precompresses a global knowledge cache conditioned on a task description, enabling fast, reusable reasoning over large corpora. Empirical results show the approach outperforms RAG on broad, multi-hop questions and long-context benchmarks (e.g., LongBench v2) and can achieve substantial latency reductions (e.g., from 0.43s to 0.16s at 30x compression). The proposed method offers a scalable, efficient alternative to full-context processing and retrieval, with potential for hybrid integration with RAG to handle both broad and narrow queries in practical deployments.

Abstract

Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

TL;DR

This work identifies fundamental limitations in retrieval-based and fully contextual approaches for knowledge-intensive reasoning in LLMs. It introduces task-aware, query-agnostic KV-cache compression that offline-precompresses a global knowledge cache conditioned on a task description, enabling fast, reusable reasoning over large corpora. Empirical results show the approach outperforms RAG on broad, multi-hop questions and long-context benchmarks (e.g., LongBench v2) and can achieve substantial latency reductions (e.g., from 0.43s to 0.16s at 30x compression). The proposed method offers a scalable, efficient alternative to full-context processing and retrieval, with potential for hybrid integration with RAG to handle both broad and narrow queries in practical deployments.

Abstract

Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.

Paper Structure

This paper contains 24 sections, 13 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Overlap match of the words between ground truth and predictions of various KV cache compression methods compared to RAG on a synthetic corpus with 32k tokens. Our Few-Shot compression approach achieves results exceeding RAG when the context length is much smaller than the corpus size.
  • Figure 2: An illustration of our compression strategy that reduces the original context (C) from a KV cache of 128k tokens to 16k. This process is guided by task instructions (T) and few-shot examples (FS), condensing the essential information needed for factual QA on the corpus documents. At inference time, the LLM can answer multiple questions as if it had access to the entire (uncompressed) corpus.
  • Figure 3: We examine how the model attends to context tokens when conditioned on the last token, a task description, a description with few-shot examples, and a description with both few-shot examples and a question. As we increase the information in the prompt, the cross attention between the prompt and the context better discriminates the context tokens that are relevant for decoding the answer. The perplexity is calculated on the loss for the answer.
  • Figure 4: Overview of our synthetic dataset. In this example, the connectivity level is set to 2.
  • Figure 5: Overview of our questions. In this example, the connectivity level is set to 2.
  • ...and 7 more figures