AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation
Yixiong Fang, Tianran Sun, Yuling Shi, Xiaodong Gu
TL;DR
AttentionRAG addresses the problem of excessive and redundant retrieved contexts in Retrieval-Augmented Generation by introducing an attention-guided context-pruning method that reformulates queries into a next-token-prediction template to focus attention on a single anchor token. The approach processes long contexts in chunks, computes chunk-level attention features, and selects sentences containing the most-attended tokens to form a compressed context, enabling high efficiency without requiring extra training. Empirical results on LongBench and Babilong show up to 6.3x compression with equal or improved task performance compared to uncompressed contexts, and competitive results versus state-of-the-art prompt-compression methods. The method demonstrates strong transferability across models and tasks, with practical speedups via batching and quantization, and offers a flexible framework for scalable, long-context RAG in real-world applications.
Abstract
While RAG demonstrates remarkable capabilities in LLM applications, its effectiveness is hindered by the ever-increasing length of retrieved contexts, which introduces information redundancy and substantial computational overhead. Existing context pruning methods, such as LLMLingua, lack contextual awareness and offer limited flexibility in controlling compression rates, often resulting in either insufficient pruning or excessive information loss. In this paper, we propose AttentionRAG, an attention-guided context pruning method for RAG systems. The core idea of AttentionRAG lies in its attention focus mechanism, which reformulates RAG queries into a next-token prediction paradigm. This mechanism isolates the query's semantic focus to a single token, enabling precise and efficient attention calculation between queries and retrieved contexts. Extensive experiments on LongBench and Babilong benchmarks show that AttentionRAG achieves up to 6.3$\times$ context compression while outperforming LLMLingua methods by around 10\% in key metrics.
