Table of Contents
Fetching ...

ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering

Huayang Li, Pat Verga, Priyanka Sen, Bowen Yang, Vijay Viswanathan, Patrick Lewis, Taro Watanabe, Yixuan Su

TL;DR

This paper addresses the decline in long-context reasoning performance of large language models and proposes ALR^2, a retrieve-then-reason framework that aligns LLMs with both retrieval and reasoning objectives. By adapting the Retrieval-Augmented Generation paradigm to long contexts and training the model to jointly retrieve coherent facts and reason over them, ALR^2 significantly improves QA performance on long-context benchmarks and reduces retrieval hallucinations. The approach demonstrates robust generalization to unseen datasets and maintains strong results across increasing context lengths, highlighting its practical impact for real-world, long-document QA tasks. The work offers a principled pathway to leverage intermediate retrieval steps to manage information over very long contexts, with potential applicability to multi-hop reasoning and explainable AI systems.

Abstract

The context window of large language models (LLMs) has been extended significantly in recent years. However, while the context length that the LLM can process has grown, the capability of the model to accurately reason over that context degrades noticeably. This occurs because modern LLMs often become overwhelmed by the vast amount of information in the context; when answering questions, the model must identify and reason over relevant evidence sparsely distributed throughout the text. To alleviate the challenge of long-context reasoning, we develop a retrieve-then-reason framework, enabling LLMs to reason over relevant evidence collected during an intermediate retrieval step. We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate "retrieved facts", resulting in flawed reasoning and the production of incorrect answers. To address these issues, we introduce ALR$^2$, a method that augments the long-context reasoning capability of LLMs via an explicit two-stage procedure, i.e., aligning LLMs with the objectives of both retrieval and reasoning. We demonstrate the efficacy of ALR$^2$ for mitigating performance degradation in long-context reasoning tasks. Through extensive experiments on long-context QA benchmarks, we find our method to outperform competitive baselines by large margins, achieving at least 8.4 and 7.9 EM gains on the long-context versions of HotpotQA and SQuAD datasets, respectively.

ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering

TL;DR

This paper addresses the decline in long-context reasoning performance of large language models and proposes ALR^2, a retrieve-then-reason framework that aligns LLMs with both retrieval and reasoning objectives. By adapting the Retrieval-Augmented Generation paradigm to long contexts and training the model to jointly retrieve coherent facts and reason over them, ALR^2 significantly improves QA performance on long-context benchmarks and reduces retrieval hallucinations. The approach demonstrates robust generalization to unseen datasets and maintains strong results across increasing context lengths, highlighting its practical impact for real-world, long-document QA tasks. The work offers a principled pathway to leverage intermediate retrieval steps to manage information over very long contexts, with potential applicability to multi-hop reasoning and explainable AI systems.

Abstract

The context window of large language models (LLMs) has been extended significantly in recent years. However, while the context length that the LLM can process has grown, the capability of the model to accurately reason over that context degrades noticeably. This occurs because modern LLMs often become overwhelmed by the vast amount of information in the context; when answering questions, the model must identify and reason over relevant evidence sparsely distributed throughout the text. To alleviate the challenge of long-context reasoning, we develop a retrieve-then-reason framework, enabling LLMs to reason over relevant evidence collected during an intermediate retrieval step. We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate "retrieved facts", resulting in flawed reasoning and the production of incorrect answers. To address these issues, we introduce ALR, a method that augments the long-context reasoning capability of LLMs via an explicit two-stage procedure, i.e., aligning LLMs with the objectives of both retrieval and reasoning. We demonstrate the efficacy of ALR for mitigating performance degradation in long-context reasoning tasks. Through extensive experiments on long-context QA benchmarks, we find our method to outperform competitive baselines by large margins, achieving at least 8.4 and 7.9 EM gains on the long-context versions of HotpotQA and SQuAD datasets, respectively.
Paper Structure (33 sections, 5 equations, 6 figures, 4 tables)

This paper contains 33 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Performance of LLMs on three increasingly challenging long-context tasks. The x-axis represents the number of tokens in the long context, while the y-axis indicates the exact match score.
  • Figure 2: Error case of the prompting-based retrieve-then-reason approach. The Command-R model and prompt in Figure \ref{['tab:twostage_prompt']} are used, and more details are in §\ref{['sec:main_exp_setup']}. The text with under-wave is the information matched between the golden facts and retrieved facts. We use the text with underline to represent information that is in the long context but not in golden facts. The information hallucinated by LLM are marked by red. The italic text in supporting facts shows the appearance of the answer.
  • Figure 3: The retrieve-then-reason (RR) prompt for long-context QA. The {CONTEXT} is the placeholder for long context and {QUERY} is for user question. The red parts {MODEL_RETRIEVED_SENTENCES} and {ANSWER} are generated by LLM.
  • Figure 4: The direct-answering (DA) prompt for long-context QA. The CONTEXT is the placeholder for long context, QUERY is for user question, and the red part {ANSWER} is the answer directly generated by LLM.
  • Figure 5: The quotes-and-citation-first (QF) prompt for long-context QA. The {CONTEXT} is the placeholder for long context and {QUERY} is for user question. The format of {QUOTES_AND_ANSWER} is similar to the example embraced by < example> and < /example>.
  • ...and 1 more figures