Reasoning on Multiple Needles In A Haystack
Yidong Wang
TL;DR
This work addresses the challenge that long-context reasoning benchmarks like MNIAH-R are undermined by memory-based answering and context-length degradation. It首先 filters questions to ensure answers rely on provided supporting documents, then analyzes why accuracy declines as context grows, finding that shorter thinking chains—not needle placement—drive the drop. The authors propose a retrieval–reasoning decomposition with a reflection-based, iterative loop (the 4R/ retrieval-reflection framework) and demonstrate its effectiveness by training an iterative-thinking model that substantially reduces performance decline. They further apply this capability to mathematical reasoning, notably improving GPT-4o’s AIME 2024 pass@1 when using the retriever-reflection approach. Overall, the paper offers a practical pathway to robust long-context reasoning through explicit thinking processes and iterative refinement, with demonstrated gains on both MNIAH-R and mathematical tasks.
Abstract
The Needle In A Haystack (NIAH) task has been widely used to evaluate the long-context question-answering capabilities of Large Language Models (LLMs). However, its reliance on simple retrieval limits its effectiveness. To address this limitation, recent studies have introduced the Multiple Needles In A Haystack Reasoning (MNIAH-R) task, which incorporates supporting documents (Multiple needles) of multi-hop reasoning tasks into a distracting context (Haystack}). Despite this advancement, existing approaches still fail to address the issue of models providing direct answers from internal knowledge, and they do not explain or mitigate the decline in accuracy as context length increases. In this paper, we tackle the memory-based answering problem by filtering out direct-answer questions, and we reveal that performance degradation is primarily driven by the reduction in the length of the thinking process as the input length increases. Building on this insight, we decompose the thinking process into retrieval and reasoning stages and introduce a reflection mechanism for multi-round extension. We also train a model using the generated iterative thinking process, which helps mitigate the performance degradation. Furthermore, we demonstrate the application of this retrieval-reflection capability in mathematical reasoning scenarios, improving GPT-4o's performance on AIME2024.
