Table of Contents
Fetching ...

Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, Hao Peng

TL;DR

The paper shows that merely extending the context can degrade LLM reasoning even when retrieval is perfect, challenging the idea that longer inputs are solvable with better retrieval alone. It introduces a controlled long-context benchmark across math, QA, and coding, and demonstrates a consistent length-driven drop in accuracy across open- and closed-source models. A simple retrieve-then-solve strategy—reciting retrieved evidence and placing it before the question—consistently improves performance, offering a practical mitigation. The work highlights the need for holistic evaluation of long-context capabilities and points to length as a fundamental bottleneck that retrieval improvements alone cannot overcome.

Abstract

Large language models (LLMs) often fail to scale their performance on long-context tasks performance in line with the context lengths they support. This gap is commonly attributed to retrieval failures -- the models' inability to identify relevant information in the long inputs. Accordingly, recent efforts often focus on evaluating and improving LLMs' retrieval performance: if retrieval is perfect, a model should, in principle, perform just as well on a long input as it does on a short one -- or should it? This paper presents findings that the answer to this question may be negative. Our systematic experiments across 5 open- and closed-source LLMs on math, question answering, and coding tasks reveal that, even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%--85%) as input length increases but remains well within the models' claimed lengths. This failure occurs even when the irrelevant tokens are replaced with minimally distracting whitespace, and, more surprisingly, when they are all masked and the models are forced to attend only to the relevant tokens. A similar performance drop is observed when all relevant evidence is placed immediately before the question. Our findings reveal a previously-unrealized limitation: the sheer length of the input alone can hurt LLM performance, independent of retrieval quality and without any distraction. They motivate our simple, model-agnostic mitigation strategy that transforms a long-context task into a short-context one by prompting the model to recite the retrieved evidence before attempting to solve the problem. On RULER, we observe a consistent improvement of GPT-4o up to 4% on an already strong baseline.

Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

TL;DR

The paper shows that merely extending the context can degrade LLM reasoning even when retrieval is perfect, challenging the idea that longer inputs are solvable with better retrieval alone. It introduces a controlled long-context benchmark across math, QA, and coding, and demonstrates a consistent length-driven drop in accuracy across open- and closed-source models. A simple retrieve-then-solve strategy—reciting retrieved evidence and placing it before the question—consistently improves performance, offering a practical mitigation. The work highlights the need for holistic evaluation of long-context capabilities and points to length as a fundamental bottleneck that retrieval improvements alone cannot overcome.

Abstract

Large language models (LLMs) often fail to scale their performance on long-context tasks performance in line with the context lengths they support. This gap is commonly attributed to retrieval failures -- the models' inability to identify relevant information in the long inputs. Accordingly, recent efforts often focus on evaluating and improving LLMs' retrieval performance: if retrieval is perfect, a model should, in principle, perform just as well on a long input as it does on a short one -- or should it? This paper presents findings that the answer to this question may be negative. Our systematic experiments across 5 open- and closed-source LLMs on math, question answering, and coding tasks reveal that, even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%--85%) as input length increases but remains well within the models' claimed lengths. This failure occurs even when the irrelevant tokens are replaced with minimally distracting whitespace, and, more surprisingly, when they are all masked and the models are forced to attend only to the relevant tokens. A similar performance drop is observed when all relevant evidence is placed immediately before the question. Our findings reveal a previously-unrealized limitation: the sheer length of the input alone can hurt LLM performance, independent of retrieval quality and without any distraction. They motivate our simple, model-agnostic mitigation strategy that transforms a long-context task into a short-context one by prompting the model to recite the retrieved evidence before attempting to solve the problem. On RULER, we observe a consistent improvement of GPT-4o up to 4% on an already strong baseline.

Paper Structure

This paper contains 27 sections, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Extending the input length alone substantially degrades LLM reasoning capability, even if the model is still able to retrieve the relevant evidence. In this example, inserting 25000 white spaces (with minimal distraction) does not prevent the model from extracting all conditions and question correctly, but nevertheless causes it to reach the wrong answer.
  • Figure 2: Left: In our synthetic benchmark, each long-context problem is created by separating a short-context problem into evidence and question, and extending the length with distraction tokens. Right: We discuss three types of distractions in this work, ordered by decreasing strength: Essay tokens (\ref{['sec:measuring']}), Whitespace (\ref{['subsec:space']}), and masking out all distraction tokens (\ref{['subsec:masking']}).
  • Figure 3: Evaluation results on Llama3-8B and Mistral-v0.3-7B, with performance accuracy in problem solving (Accuracy) and retrieval scores measured by Exact Match (Retrieval). "Context length" refers to the total number of input tokens for each problem, which is crafted by inserting PaulGrahamEssay tokens between evidence and question (as illustrated in \ref{['fig:filler_experiment']}). See Appendix for detailed numbers.
  • Figure 4: Performance across different context lengths on Llama-3-8B Instruct and Mistral-v0.3-7B-Instruct, with corresponding numbers of whitespace tokens inserted for minimum distraction. (a, Left) Whitespaces are inserted betweenevidence and question. (b, Right) Whitespaces are inserted beforeevidence, and question adjacent to evidence.
  • Figure 5: Our strategy retrieves evidence to shorten the context length before solving the task.
  • ...and 8 more figures