NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

Hyeonseok Moon; Heuiseok Lim

NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

Hyeonseok Moon, Heuiseok Lim

TL;DR

NeedleChain addresses the problem that current long-context benchmarks may overstate context understanding by embedding query-relevant information in a way that enables partial reliance on snippet retrieval. The authors define NeedleChain with Independent and Dependent needles arranged in Forward, Backward, and Mixed chains, plus a NeedleStack NIAH baseline, to rigorously test intact context integration; they also propose ROPE contraction to sharpen positional distinctions and improve full-context utilization. Experiments across multiple LLMs show pronounced weaknesses in reverse-direction reasoning and in maintaining all context as $k$ grows (up to $k=50$ in main results), with calculation errors dominating at small $k$ and needle omissions rising with larger $k$. The findings suggest that simply increasing context length is insufficient for robust reasoning over context and point to forward-ordered presentation and simple techniques like ROPE contraction as practical avenues to enhance intact-context understanding in LLMs.

Abstract

Recent reports suggest that LLMs can handle increasingly long contexts. However, many existing benchmarks for context understanding embed substantial query-irrelevant content, which shifts evaluation toward retrieving relevant snippets rather than fully integrating all provided information. Under this setting, we view that current benchmarks can overestimate true context-understanding ability of LLMs. In particular, we demonstrate that when the context consists entirely of query-relevant text, even advanced models such as GPT-4o fail to reliably integrate inputs as short as 200 tokens. To evaluate this capability more rigorously, we introduce NeedleChain, a benchmark designed to test whether models can faithfully incorporate all given evidence. NeedleChain includes three variants that differ in the required order of comprehension, along with a parallel benchmark based on the needle-in-a-haystack(NIAH) paradigm. By comparing these variants, NeedleChain enables a more comprehensive assessment of context understanding. We further propose a training-free strategy that encourages models to reflect all available information, ROPE contraction, highlighting the importance of full-context integration and pointing to new directions for improving reliable reasoning over context.

NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

TL;DR

Abstract

NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)