Table of Contents
Fetching ...

NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

Hyeonseok Moon, Heuiseok Lim

TL;DR

NeedleChain addresses the problem that current long-context benchmarks may overstate context understanding by embedding query-relevant information in a way that enables partial reliance on snippet retrieval. The authors define NeedleChain with Independent and Dependent needles arranged in Forward, Backward, and Mixed chains, plus a NeedleStack NIAH baseline, to rigorously test intact context integration; they also propose ROPE contraction to sharpen positional distinctions and improve full-context utilization. Experiments across multiple LLMs show pronounced weaknesses in reverse-direction reasoning and in maintaining all context as $k$ grows (up to $k=50$ in main results), with calculation errors dominating at small $k$ and needle omissions rising with larger $k$. The findings suggest that simply increasing context length is insufficient for robust reasoning over context and point to forward-ordered presentation and simple techniques like ROPE contraction as practical avenues to enhance intact-context understanding in LLMs.

Abstract

Recent reports suggest that LLMs can handle increasingly long contexts. However, many existing benchmarks for context understanding embed substantial query-irrelevant content, which shifts evaluation toward retrieving relevant snippets rather than fully integrating all provided information. Under this setting, we view that current benchmarks can overestimate true context-understanding ability of LLMs. In particular, we demonstrate that when the context consists entirely of query-relevant text, even advanced models such as GPT-4o fail to reliably integrate inputs as short as 200 tokens. To evaluate this capability more rigorously, we introduce NeedleChain, a benchmark designed to test whether models can faithfully incorporate all given evidence. NeedleChain includes three variants that differ in the required order of comprehension, along with a parallel benchmark based on the needle-in-a-haystack(NIAH) paradigm. By comparing these variants, NeedleChain enables a more comprehensive assessment of context understanding. We further propose a training-free strategy that encourages models to reflect all available information, ROPE contraction, highlighting the importance of full-context integration and pointing to new directions for improving reliable reasoning over context.

NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

TL;DR

NeedleChain addresses the problem that current long-context benchmarks may overstate context understanding by embedding query-relevant information in a way that enables partial reliance on snippet retrieval. The authors define NeedleChain with Independent and Dependent needles arranged in Forward, Backward, and Mixed chains, plus a NeedleStack NIAH baseline, to rigorously test intact context integration; they also propose ROPE contraction to sharpen positional distinctions and improve full-context utilization. Experiments across multiple LLMs show pronounced weaknesses in reverse-direction reasoning and in maintaining all context as grows (up to in main results), with calculation errors dominating at small and needle omissions rising with larger . The findings suggest that simply increasing context length is insufficient for robust reasoning over context and point to forward-ordered presentation and simple techniques like ROPE contraction as practical avenues to enhance intact-context understanding in LLMs.

Abstract

Recent reports suggest that LLMs can handle increasingly long contexts. However, many existing benchmarks for context understanding embed substantial query-irrelevant content, which shifts evaluation toward retrieving relevant snippets rather than fully integrating all provided information. Under this setting, we view that current benchmarks can overestimate true context-understanding ability of LLMs. In particular, we demonstrate that when the context consists entirely of query-relevant text, even advanced models such as GPT-4o fail to reliably integrate inputs as short as 200 tokens. To evaluate this capability more rigorously, we introduce NeedleChain, a benchmark designed to test whether models can faithfully incorporate all given evidence. NeedleChain includes three variants that differ in the required order of comprehension, along with a parallel benchmark based on the needle-in-a-haystack(NIAH) paradigm. By comparing these variants, NeedleChain enables a more comprehensive assessment of context understanding. We further propose a training-free strategy that encourages models to reflect all available information, ROPE contraction, highlighting the importance of full-context integration and pointing to new directions for improving reliable reasoning over context.

Paper Structure

This paper contains 24 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Performance comparison between the NeedleChain (Backward chain) and its parallel NIAH paradigm benchmark (NeedleStack). Reported number of tokens were estimated with Qwen2.5 tokenizer.
  • Figure 2: Performance variation with respect to the domain composition of training data
  • Figure 3: Error analysis on NeedleChain. We analyze errors in each category to determine which of the three predefined error types they fall into.
  • Figure 4: Heatmap to show the weaknesses for each position. Left-sided figures shows positional needle-missing heatmap with respect to the "presented order". Right-sided figures shows those of "reasoning order". We conducted experiments with k=200.
  • Figure 5: We compare the accuracy of models for different types of questions: those requires understanding the tail of the reasoning chain ($q_{single}$) and those requiring comprehensive understanding of the entire context ($q_{total}$).
  • ...and 2 more figures