Table of Contents
Fetching ...

How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation

Yan Meng, Di Wu, Christof Monz

TL;DR

This work tackles semantic misalignment as the main noise in web-mined parallel data for machine translation. It introduces a misalignment simulator controlled by semantic similarity and a self-correction training method that gradually shifts trust from ground-truth targets to the model’s own predictions using a dynamic schedule and sharpening of predicted distributions. Empirical results show that self-correction consistently outperforms traditional pre-filters and truncation baselines across simulated and real noisy datasets, with notable gains in low-resource settings and on real web-mined corpora. The approach preserves clean data performance while improving misaligned data translations, highlighting the practical value of leveraging the model’s own predictions to revise supervision during training.

Abstract

The massive amounts of web-mined parallel data contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first introduce a process for simulating misalignment controlled by semantic similarity, which closely resembles misaligned sentences in real-world web-crawled corpora. Under our simulated misalignment noise settings, we quantitatively analyze its impact on machine translation and demonstrate the limited effectiveness of widely used pre-filters for noise detection. This underscores the necessity of more fine-grained ways to handle hard-to-detect misalignment noise. With an observation of the increasing reliability of the model's self-knowledge for distinguishing misaligned and clean data at the token level, we propose self-correction, an approach that gradually increases trust in the model's self-knowledge to correct the training supervision. Comprehensive experiments show that our method significantly improves translation performance both in the presence of simulated misalignment noise and when applied to real-world, noisy web-mined datasets, across a range of translation tasks.

How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation

TL;DR

This work tackles semantic misalignment as the main noise in web-mined parallel data for machine translation. It introduces a misalignment simulator controlled by semantic similarity and a self-correction training method that gradually shifts trust from ground-truth targets to the model’s own predictions using a dynamic schedule and sharpening of predicted distributions. Empirical results show that self-correction consistently outperforms traditional pre-filters and truncation baselines across simulated and real noisy datasets, with notable gains in low-resource settings and on real web-mined corpora. The approach preserves clean data performance while improving misaligned data translations, highlighting the practical value of leveraging the model’s own predictions to revise supervision during training.

Abstract

The massive amounts of web-mined parallel data contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first introduce a process for simulating misalignment controlled by semantic similarity, which closely resembles misaligned sentences in real-world web-crawled corpora. Under our simulated misalignment noise settings, we quantitatively analyze its impact on machine translation and demonstrate the limited effectiveness of widely used pre-filters for noise detection. This underscores the necessity of more fine-grained ways to handle hard-to-detect misalignment noise. With an observation of the increasing reliability of the model's self-knowledge for distinguishing misaligned and clean data at the token level, we propose self-correction, an approach that gradually increases trust in the model's self-knowledge to correct the training supervision. Comprehensive experiments show that our method significantly improves translation performance both in the presence of simulated misalignment noise and when applied to real-world, noisy web-mined datasets, across a range of translation tasks.
Paper Structure (46 sections, 5 equations, 5 figures, 14 tables)

This paper contains 46 sections, 5 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: The accuracy of various data filters in distinguishing misaligned noise from clean parallel data. All four data filters perform similarly to random guessing (indicated by the black dashed line) on Misaligned-LASER/COMET.
  • Figure 2: loss (above) and el2n (below) distribution for clean and Misaligned-LASER noise samples during the training process (Epoch = 5 and 30). Red distribution represents misaligned-LASER noise and Blue distribution represents the clean data. As training progresses, el2n distributions for clean and noisy data shift differently. The distribution plots for the full training process are in the Appendix in Figure \ref{['fig:el2n-more']}.
  • Figure 3: Performance differences between our self-correction method and baseline on noisy (Misaligned-LASER) and clean data for De$\rightarrow$En task with 30% injected misaligned-LASER. The effectiveness of our method mainly arises from improving the misaligned noisy data over clean ones.
  • Figure 4: loss (above) and el2n (below) distribution for clean and misaligned-LASER noise samples during the training process (Epoch = 5, 10, 15, 30). Red distribution represents misaligned-LASER noise and blue distribution represents the clean data.
  • Figure 5: BLEU scores from the self-correction models on De$\rightarrow$En task with 30% different types of injected noise with varying $\tau$.