Table of Contents
Fetching ...

You Only Use Reactive Attention Slice For Long Context Retrieval

Yun Joon Soh, Hanxian Huang, Yuandong Tian, Jishen Zhao

TL;DR

The paper tackles the challenge of extending LLM context windows without fine-tuning by introducing YOURA, an attention-based retrieval method that ranks sentences using a reaction score derived from changes in the model's attention distribution. A key innovation is the Reaction Vector and its averaging into a Reaction Score to selectively retrieve the most informative sentences, combined with Embedding-Agnostic Sentence Yield (EASY) to robustly map sentences to token indices across diverse tokenizers. Evaluations on LongBench with multiple open-source models show YOURA improves QA quality and substantially increases inference throughput (up to ~30%) by retrieving fewer, more relevant tokens, while maintaining comparable answer accuracy to truncation strategies. The approach is fully fine-tuning-free and integrates with off-the-shelf LLMs, offering a practical path to scalable long-context QA and broader applicability to retrieval-augmented generation tasks.

Abstract

Supporting longer context for Large Language Models (LLM) is a promising direction to advance LLMs. As training a model for a longer context window is computationally expensive, many alternative solutions, such as Retrieval Augmented Generation (RAG), have been used. However, most existing RAG methods adopt embedding-based retrieval that falls short on long contexts. To address such challenges, we propose an attention-based retrieval technique, You Only Use Reactive Attention slice (YOURA). YOURA leverages a novel retrieval heuristic called reaction score to rank the relevance of each sentence in the input context with the query sentence. Intuitively, we measure how the per-token attention score "reacts" to the query and greedily retrieves the most reactive sentences. Internally, YOURA generates a token-indexed vector (called reaction vector) for the whole input context. To map each sentence to the token-indexed vector, we propose an Embedding-Agnostic Sentence Yield (EASY), a best-effort token wiggling algorithm. We evaluate our retrieval technique on three open-source pre-trained LLM models across six LongBench QA datasets. Our technique achieves up to 30% vLLM inference throughput improvement for serving long-context queries with a nearly identical quality score to the simple yet effective truncate-middle approach.

You Only Use Reactive Attention Slice For Long Context Retrieval

TL;DR

The paper tackles the challenge of extending LLM context windows without fine-tuning by introducing YOURA, an attention-based retrieval method that ranks sentences using a reaction score derived from changes in the model's attention distribution. A key innovation is the Reaction Vector and its averaging into a Reaction Score to selectively retrieve the most informative sentences, combined with Embedding-Agnostic Sentence Yield (EASY) to robustly map sentences to token indices across diverse tokenizers. Evaluations on LongBench with multiple open-source models show YOURA improves QA quality and substantially increases inference throughput (up to ~30%) by retrieving fewer, more relevant tokens, while maintaining comparable answer accuracy to truncation strategies. The approach is fully fine-tuning-free and integrates with off-the-shelf LLMs, offering a practical path to scalable long-context QA and broader applicability to retrieval-augmented generation tasks.

Abstract

Supporting longer context for Large Language Models (LLM) is a promising direction to advance LLMs. As training a model for a longer context window is computationally expensive, many alternative solutions, such as Retrieval Augmented Generation (RAG), have been used. However, most existing RAG methods adopt embedding-based retrieval that falls short on long contexts. To address such challenges, we propose an attention-based retrieval technique, You Only Use Reactive Attention slice (YOURA). YOURA leverages a novel retrieval heuristic called reaction score to rank the relevance of each sentence in the input context with the query sentence. Intuitively, we measure how the per-token attention score "reacts" to the query and greedily retrieves the most reactive sentences. Internally, YOURA generates a token-indexed vector (called reaction vector) for the whole input context. To map each sentence to the token-indexed vector, we propose an Embedding-Agnostic Sentence Yield (EASY), a best-effort token wiggling algorithm. We evaluate our retrieval technique on three open-source pre-trained LLM models across six LongBench QA datasets. Our technique achieves up to 30% vLLM inference throughput improvement for serving long-context queries with a nearly identical quality score to the simple yet effective truncate-middle approach.
Paper Structure (32 sections, 7 equations, 3 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 7 equations, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: YOURA improves the retrieval quality and inference throughput by retrieving only the "reactive" sentences.
  • Figure 2: Comparison of a typical embedding-based retriever (cosine similarity distance), truncation-based approach, and our attention-based retriever (reaction score). In our evaluation, we show that reaction score is a better retrieval heuristic in terms of quality and performance. Unlike the embedding-based approach, which splits the raw string context into chunks (e.g., sentences) before retrieval, YOURA splits the token-indexed vector, reaction-vector in (c), into chunks. Such an approach requires mapping the raw sentences to token indices. We propose Embedding Agnostic Sentence Yield (EASY) algorithm for the challenge.
  • Figure 3: Overview of YOURA and where it is used in the Retrieval Augmented Generation (RAG) with an example (example context: "I am an amazing researcher. I like LLM.", example query: "Who am I?"). The first step is calculating the reaction vector, an absolute difference between the attention vector with and without the query (left side of the figure). The highlighted cells in the attention matrix indicate that the token pair exhibits a relatively high value (e.g., Who vs. I). Once the reaction vector has been calculated, each sentence is assigned a reaction score, the mean of a corresponding reaction vector slice. To map each sentence to a token sequence, we propose Embedding-Agnostic Sentence Yield (EASY) algorithm (Section \ref{['sec:easy']}). The retriever passes on the sentences with high reaction scores to the augmentor. The pre-trained LLM models generate answers using the augmented text, which includes the task-specific prompt, the retrieved context, and the question.