Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning
Giulio Corallo, Orion Weller, Fabio Petroni, Paolo Papotti
TL;DR
This work identifies fundamental limitations in retrieval-based and fully contextual approaches for knowledge-intensive reasoning in LLMs. It introduces task-aware, query-agnostic KV-cache compression that offline-precompresses a global knowledge cache conditioned on a task description, enabling fast, reusable reasoning over large corpora. Empirical results show the approach outperforms RAG on broad, multi-hop questions and long-context benchmarks (e.g., LongBench v2) and can achieve substantial latency reductions (e.g., from 0.43s to 0.16s at 30x compression). The proposed method offers a scalable, efficient alternative to full-context processing and retrieval, with potential for hybrid integration with RAG to handle both broad and narrow queries in practical deployments.
Abstract
Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.
