Table of Contents
Fetching ...

Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval

Yuwei Zhang, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang

TL;DR

A novel training-free algorithm is proposed, Attrieval, which leverages attention weights to retrieve relevant facts from the long context and incorporates them into the reasoning process, and demonstrates that Attrieval enhances long-context reasoning capability notably on both synthetic and real-world QA datasets with various models.

Abstract

Large Language Models (LLMs) often exhibit substantially shorter effective context lengths than their claimed capacities, especially when handling complex reasoning tasks that require integrating information from multiple parts of a long context and performing multi-step reasoning. Although Chain-of-Thought (CoT) prompting has shown promise in reducing task complexity, our empirical analysis reveals that it does not fully resolve this limitation. Through controlled experiments, we identify poor recall of implicit facts as the primary cause of failure, which significantly hampers reasoning performance. Interestingly, we observe that the internal attention weights from the generated CoT tokens can effectively ground implicit facts, even when these facts are not explicitly recalled. Building on this insight, we propose a novel training-free algorithm, Attrieval, which leverages attention weights to retrieve relevant facts from the long context and incorporates them into the reasoning process. Additionally, we find that selecting context tokens from CoT tokens further improves performance. Our results demonstrate that Attrieval enhances long-context reasoning capability notably on both synthetic and real-world QA datasets with various models.

Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval

TL;DR

A novel training-free algorithm is proposed, Attrieval, which leverages attention weights to retrieve relevant facts from the long context and incorporates them into the reasoning process, and demonstrates that Attrieval enhances long-context reasoning capability notably on both synthetic and real-world QA datasets with various models.

Abstract

Large Language Models (LLMs) often exhibit substantially shorter effective context lengths than their claimed capacities, especially when handling complex reasoning tasks that require integrating information from multiple parts of a long context and performing multi-step reasoning. Although Chain-of-Thought (CoT) prompting has shown promise in reducing task complexity, our empirical analysis reveals that it does not fully resolve this limitation. Through controlled experiments, we identify poor recall of implicit facts as the primary cause of failure, which significantly hampers reasoning performance. Interestingly, we observe that the internal attention weights from the generated CoT tokens can effectively ground implicit facts, even when these facts are not explicitly recalled. Building on this insight, we propose a novel training-free algorithm, Attrieval, which leverages attention weights to retrieve relevant facts from the long context and incorporates them into the reasoning process. Additionally, we find that selecting context tokens from CoT tokens further improves performance. Our results demonstrate that Attrieval enhances long-context reasoning capability notably on both synthetic and real-world QA datasets with various models.

Paper Structure

This paper contains 17 sections, 8 equations, 7 figures, 3 tables, 2 algorithms.

Figures (7)

  • Figure 1: Both Retrieval-Reason (agentic framework) and Chain-of-Though (CoT) might suffer from poor recall of implicit facts. Our proposed Attrieval leverage the internal attention weights to resolve this issue.
  • Figure 2: Analysis on CoT tokens, including: (a) recall with various retrieval methods; (b) accuracy with various prompts and questions. See \ref{['sec:observation']} for more details.
  • Figure 3: Proportion of attention from generated tokens to the input prompt across layers.
  • Figure 4: BABILONG results. Greener colors represent higher scores.
  • Figure 5: Ranking of tokens most attended in the statements. The example shows a failure case.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1: Long-Context Reasoning