Table of Contents
Fetching ...

Revisiting Long-context Modeling from Context Denoising Perspective

Zecheng Tang, Baibei Ji, Juntao Li, Lijun Wu, Haijia Gui, Min Zhang

TL;DR

This work investigates why long-context models struggle with noise in extended inputs and proposes a two-part solution: a fine-grained critical-token detector based on Integrated Gradients (IG) and a Context Denoising Training (CDT) procedure. CDT denoises the input by down-weighting irrelevant tokens via a gradient-informed mechanism and trains the model to strengthen the connection between salient tokens and predictions, effectively implementing an online EM-like process. Across four task families and multiple model families, CDT yields consistent gains, outperforming strong baselines and, in some cases, matching GPT-4o performance with an open-model. The results demonstrate that reducing contextual noise can substantially improve attention to critical information, enabling more reliable long-context understanding with practical training efficiency gains.

Abstract

Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent research reveals that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens, that can mislead model attention. In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, the Integrated Gradient (IG) score, to detect and quantify the noise information within the context. Our findings reveal that even simple mitigation of detected context noise can substantially boost the model's attention on critical tokens and benefit subsequent predictions. Building on this insight, we propose Context Denoising Training (CDT), a straightforward yet effective training strategy that improves attention on critical tokens while reinforcing their influence on model predictions. Extensive experiments across four tasks, under both context window scaling and long-context alignment settings, demonstrate the superiority of CDT. Notably, when trained with CDT, an open-source 8B model can achieve performance (50.92) comparable to GPT-4o (51.00).

Revisiting Long-context Modeling from Context Denoising Perspective

TL;DR

This work investigates why long-context models struggle with noise in extended inputs and proposes a two-part solution: a fine-grained critical-token detector based on Integrated Gradients (IG) and a Context Denoising Training (CDT) procedure. CDT denoises the input by down-weighting irrelevant tokens via a gradient-informed mechanism and trains the model to strengthen the connection between salient tokens and predictions, effectively implementing an online EM-like process. Across four task families and multiple model families, CDT yields consistent gains, outperforming strong baselines and, in some cases, matching GPT-4o performance with an open-model. The results demonstrate that reducing contextual noise can substantially improve attention to critical information, enabling more reliable long-context understanding with practical training efficiency gains.

Abstract

Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent research reveals that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens, that can mislead model attention. In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, the Integrated Gradient (IG) score, to detect and quantify the noise information within the context. Our findings reveal that even simple mitigation of detected context noise can substantially boost the model's attention on critical tokens and benefit subsequent predictions. Building on this insight, we propose Context Denoising Training (CDT), a straightforward yet effective training strategy that improves attention on critical tokens while reinforcing their influence on model predictions. Extensive experiments across four tasks, under both context window scaling and long-context alignment settings, demonstrate the superiority of CDT. Notably, when trained with CDT, an open-source 8B model can achieve performance (50.92) comparable to GPT-4o (51.00).

Paper Structure

This paper contains 56 sections, 11 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: Comparative overview of model performance on real-world long-context tasks and performance gain per billion tokens among different training methods. The bubble size indicates the relative training data volume.
  • Figure 2: Task format of our preliminary study, which requires models to predict the final answer by reasoning through multi-hop Supporting Facts and distinguishing from the Interference Facts. Simultaneously, the model should also resist the influence of Irreverent Documents and Low-Frequency Words. More details are shown in Appendix \ref{['appdix:pre_task']}.
  • Figure 3: Comparison between attention distribution and information flow on the critical token location task. A significant difference in the distributions of critical and irrelevant contexts is revealed.
  • Figure 4: Attention distributions before and after manual context denoising. After context denoising, attention scores on critical tokens boost $\times 10$ times, and show a reduction on irrelevant tokens.
  • Figure 5: Relationship between attention IG score and L2-normalized embedding gradients on different types of tokens. It shows a proportional correlation.
  • ...and 9 more figures