Table of Contents
Fetching ...

Contextual Drag: How Errors in the Context Affect LLM Reasoning

Yun Cheng, Xingyu Zhu, Haoyu Zhao, Sanjeev Arora

TL;DR

Contextual drag reveals a persistent failure mode in large-language-model reasoning: erroneous in-context drafts bias subsequent generations toward similar structural errors, causing 10–20% drops across models and tasks and even self-deterioration in iterative refinement. The study conducts large-scale empirical evaluations across 11 models and 8 benchmarks, introduces structural analysis via tree edit distance, and demonstrates that external signals or post-hoc verification do not fully counteract the bias. Mitigations like test-time context denoising and targeted supervised fine-tuning yield partial improvements but fail to restore clean-slate performance, highlighting fundamental limitations of current reasoning architectures for self-improvement pipelines. The findings underscore the need for principled mechanisms to reset or discount unreliable context to enable reliable multi-step reasoning and safer agent-like behavior in AI systems.

Abstract

Central to many self-improvement pipelines for large language models (LLMs) is the assumption that models can improve by reflecting on past mistakes. We study a phenomenon termed contextual drag: the presence of failed attempts in the context biases subsequent generations toward structurally similar errors. Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration. Structural analysis using tree edit distance reveals that subsequent reasoning trajectories inherit structurally similar error patterns from the context. We demonstrate that neither external feedback nor successful self-verification suffices to eliminate this effect. While mitigation strategies such as fallback-behavior fine-tuning and context denoising yield partial improvements, they fail to fully restore baseline performance, positioning contextual drag as a persistent failure mode in current reasoning architectures.

Contextual Drag: How Errors in the Context Affect LLM Reasoning

TL;DR

Contextual drag reveals a persistent failure mode in large-language-model reasoning: erroneous in-context drafts bias subsequent generations toward similar structural errors, causing 10–20% drops across models and tasks and even self-deterioration in iterative refinement. The study conducts large-scale empirical evaluations across 11 models and 8 benchmarks, introduces structural analysis via tree edit distance, and demonstrates that external signals or post-hoc verification do not fully counteract the bias. Mitigations like test-time context denoising and targeted supervised fine-tuning yield partial improvements but fail to restore clean-slate performance, highlighting fundamental limitations of current reasoning architectures for self-improvement pipelines. The findings underscore the need for principled mechanisms to reset or discount unreliable context to enable reliable multi-step reasoning and safer agent-like behavior in AI systems.

Abstract

Central to many self-improvement pipelines for large language models (LLMs) is the assumption that models can improve by reflecting on past mistakes. We study a phenomenon termed contextual drag: the presence of failed attempts in the context biases subsequent generations toward structurally similar errors. Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration. Structural analysis using tree edit distance reveals that subsequent reasoning trajectories inherit structurally similar error patterns from the context. We demonstrate that neither external feedback nor successful self-verification suffices to eliminate this effect. While mitigation strategies such as fallback-behavior fine-tuning and context denoising yield partial improvements, they fail to fully restore baseline performance, positioning contextual drag as a persistent failure mode in current reasoning architectures.
Paper Structure (46 sections, 1 equation, 10 figures, 15 tables)

This paper contains 46 sections, 1 equation, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Contextual drag is characterized by the performance drop from clean-slate generation, where the model generates with no additional context, to error-conditioned generation, where the model generates conditioned on incorrect draft solutions in the context. Contextual drag not only manifests as performance degradation but also as a bias in reasoning patterns toward erroneous context.
  • Figure 2: Self-deterioration: Iterative refinement of models with severe degradation from contextual drag, such as GPT-OSS-20B, collapses in accuracy across iterations, whereas majority voting improves steadily. We sample 16 iterative refinement trajectories and report the average performance across all trajectories.
  • Figure 3: Contextual drag arises from copying reasoning patterns: Measured by mean tree edit distance (TED), models' subsequent solutions under contextual drag (1F) stay significantly closer to the in-context erroneous reasoning than to clean-slate solutions (Direct). Lower TED indicates stronger structural similarity to the in-context draft but not necessarily worse performance.
  • Figure 4: External error signal does not recover subsequent reasoning from contextual drag: Models still experience significant drops from Direct to 1F across benchmarks with the exception of MMLU, which is usually considered more knowledge-intensive than reasoning-intensive. Full results are provided in \ref{['tab:appd-external-error']}.
  • Figure 5: Correct self-verification gives rise to varying robustness to contextual drag: Conditioning on self-detected error signals, contextual drag still persists with varied performance changes: Nemotron-7B/32B recover toward (and even surpass) Direct performance, while GPT-OSS-20B remains strongly degraded. Full results are provided in \ref{['tab:appd-self-detected-error']}.
  • ...and 5 more figures