Table of Contents
Fetching ...

Reasoning on Multiple Needles In A Haystack

Yidong Wang

TL;DR

This work addresses the challenge that long-context reasoning benchmarks like MNIAH-R are undermined by memory-based answering and context-length degradation. It首先 filters questions to ensure answers rely on provided supporting documents, then analyzes why accuracy declines as context grows, finding that shorter thinking chains—not needle placement—drive the drop. The authors propose a retrieval–reasoning decomposition with a reflection-based, iterative loop (the 4R/ retrieval-reflection framework) and demonstrate its effectiveness by training an iterative-thinking model that substantially reduces performance decline. They further apply this capability to mathematical reasoning, notably improving GPT-4o’s AIME 2024 pass@1 when using the retriever-reflection approach. Overall, the paper offers a practical pathway to robust long-context reasoning through explicit thinking processes and iterative refinement, with demonstrated gains on both MNIAH-R and mathematical tasks.

Abstract

The Needle In A Haystack (NIAH) task has been widely used to evaluate the long-context question-answering capabilities of Large Language Models (LLMs). However, its reliance on simple retrieval limits its effectiveness. To address this limitation, recent studies have introduced the Multiple Needles In A Haystack Reasoning (MNIAH-R) task, which incorporates supporting documents (Multiple needles) of multi-hop reasoning tasks into a distracting context (Haystack}). Despite this advancement, existing approaches still fail to address the issue of models providing direct answers from internal knowledge, and they do not explain or mitigate the decline in accuracy as context length increases. In this paper, we tackle the memory-based answering problem by filtering out direct-answer questions, and we reveal that performance degradation is primarily driven by the reduction in the length of the thinking process as the input length increases. Building on this insight, we decompose the thinking process into retrieval and reasoning stages and introduce a reflection mechanism for multi-round extension. We also train a model using the generated iterative thinking process, which helps mitigate the performance degradation. Furthermore, we demonstrate the application of this retrieval-reflection capability in mathematical reasoning scenarios, improving GPT-4o's performance on AIME2024.

Reasoning on Multiple Needles In A Haystack

TL;DR

This work addresses the challenge that long-context reasoning benchmarks like MNIAH-R are undermined by memory-based answering and context-length degradation. It首先 filters questions to ensure answers rely on provided supporting documents, then analyzes why accuracy declines as context grows, finding that shorter thinking chains—not needle placement—drive the drop. The authors propose a retrieval–reasoning decomposition with a reflection-based, iterative loop (the 4R/ retrieval-reflection framework) and demonstrate its effectiveness by training an iterative-thinking model that substantially reduces performance decline. They further apply this capability to mathematical reasoning, notably improving GPT-4o’s AIME 2024 pass@1 when using the retriever-reflection approach. Overall, the paper offers a practical pathway to robust long-context reasoning through explicit thinking processes and iterative refinement, with demonstrated gains on both MNIAH-R and mathematical tasks.

Abstract

The Needle In A Haystack (NIAH) task has been widely used to evaluate the long-context question-answering capabilities of Large Language Models (LLMs). However, its reliance on simple retrieval limits its effectiveness. To address this limitation, recent studies have introduced the Multiple Needles In A Haystack Reasoning (MNIAH-R) task, which incorporates supporting documents (Multiple needles) of multi-hop reasoning tasks into a distracting context (Haystack}). Despite this advancement, existing approaches still fail to address the issue of models providing direct answers from internal knowledge, and they do not explain or mitigate the decline in accuracy as context length increases. In this paper, we tackle the memory-based answering problem by filtering out direct-answer questions, and we reveal that performance degradation is primarily driven by the reduction in the length of the thinking process as the input length increases. Building on this insight, we decompose the thinking process into retrieval and reasoning stages and introduce a reflection mechanism for multi-round extension. We also train a model using the generated iterative thinking process, which helps mitigate the performance degradation. Furthermore, we demonstrate the application of this retrieval-reflection capability in mathematical reasoning scenarios, improving GPT-4o's performance on AIME2024.

Paper Structure

This paper contains 18 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Performance on MNIAH-R before filtering.
  • Figure 2: Performance on MNIAH-R after filtering.
  • Figure 3: Impact of Needles Placement Positions.
  • Figure 4: Impact of Distance Between Needles.
  • Figure 5: Impact of context length on thinking process length for model with significant accuracy decline.
  • ...and 9 more figures