You Only Use Reactive Attention Slice For Long Context Retrieval
Yun Joon Soh, Hanxian Huang, Yuandong Tian, Jishen Zhao
TL;DR
The paper tackles the challenge of extending LLM context windows without fine-tuning by introducing YOURA, an attention-based retrieval method that ranks sentences using a reaction score derived from changes in the model's attention distribution. A key innovation is the Reaction Vector and its averaging into a Reaction Score to selectively retrieve the most informative sentences, combined with Embedding-Agnostic Sentence Yield (EASY) to robustly map sentences to token indices across diverse tokenizers. Evaluations on LongBench with multiple open-source models show YOURA improves QA quality and substantially increases inference throughput (up to ~30%) by retrieving fewer, more relevant tokens, while maintaining comparable answer accuracy to truncation strategies. The approach is fully fine-tuning-free and integrates with off-the-shelf LLMs, offering a practical path to scalable long-context QA and broader applicability to retrieval-augmented generation tasks.
Abstract
Supporting longer context for Large Language Models (LLM) is a promising direction to advance LLMs. As training a model for a longer context window is computationally expensive, many alternative solutions, such as Retrieval Augmented Generation (RAG), have been used. However, most existing RAG methods adopt embedding-based retrieval that falls short on long contexts. To address such challenges, we propose an attention-based retrieval technique, You Only Use Reactive Attention slice (YOURA). YOURA leverages a novel retrieval heuristic called reaction score to rank the relevance of each sentence in the input context with the query sentence. Intuitively, we measure how the per-token attention score "reacts" to the query and greedily retrieves the most reactive sentences. Internally, YOURA generates a token-indexed vector (called reaction vector) for the whole input context. To map each sentence to the token-indexed vector, we propose an Embedding-Agnostic Sentence Yield (EASY), a best-effort token wiggling algorithm. We evaluate our retrieval technique on three open-source pre-trained LLM models across six LongBench QA datasets. Our technique achieves up to 30% vLLM inference throughput improvement for serving long-context queries with a nearly identical quality score to the simple yet effective truncate-middle approach.
