Table of Contents
Fetching ...

Did Translation Models Get More Robust Without Anyone Even Noticing?

Ben Peters, André F. T. Martins

TL;DR

This study challenges the conventional wisdom that MT robustness to noise hinges on specialized training, showing that modern large multilingual models and LLMs exhibit markedly greater tolerance to synthetic and social-media noise even without robustness-focused techniques. Through controlled synthetic perturbations, in-domain social-media tests, and reference-free metrics, the authors reveal that robustness is not solely a function of model size or architecture but is driven by data exposure and training paradigms. They introduce COMET-slope and DeltaQE as robust measures of degradation under noise and demonstrate that finetuning on noisy data or applying a source-correction pipeline can further boost robustness, sometimes surpassing open LLMs on synthetic tasks. The work suggests practical strategies to improve smaller models and highlights the remaining gaps in low-resource languages and broader noise domains, offering concrete avenues for routing decisions and future research.

Abstract

Neural machine translation (MT) models achieve strong results across a variety of settings, but it is widely believed that they are highly sensitive to "noisy" inputs, such as spelling errors, abbreviations, and other formatting issues. In this paper, we revisit this insight in light of recent multilingual MT models and large language models (LLMs) applied to machine translation. Somewhat surprisingly, we show through controlled experiments that these models are far more robust to many kinds of noise than previous models, even when they perform similarly on clean data. This is notable because, even though LLMs have more parameters and more complex training processes than past models, none of the open ones we consider use any techniques specifically designed to encourage robustness. Next, we show that similar trends hold for social media translation experiments -- LLMs are more robust to social media text. We include an analysis of the circumstances in which source correction techniques can be used to mitigate the effects of noise. Altogether, we show that robustness to many types of noise has increased.

Did Translation Models Get More Robust Without Anyone Even Noticing?

TL;DR

This study challenges the conventional wisdom that MT robustness to noise hinges on specialized training, showing that modern large multilingual models and LLMs exhibit markedly greater tolerance to synthetic and social-media noise even without robustness-focused techniques. Through controlled synthetic perturbations, in-domain social-media tests, and reference-free metrics, the authors reveal that robustness is not solely a function of model size or architecture but is driven by data exposure and training paradigms. They introduce COMET-slope and DeltaQE as robust measures of degradation under noise and demonstrate that finetuning on noisy data or applying a source-correction pipeline can further boost robustness, sometimes surpassing open LLMs on synthetic tasks. The work suggests practical strategies to improve smaller models and highlights the remaining gaps in low-resource languages and broader noise domains, offering concrete avenues for routing decisions and future research.

Abstract

Neural machine translation (MT) models achieve strong results across a variety of settings, but it is widely believed that they are highly sensitive to "noisy" inputs, such as spelling errors, abbreviations, and other formatting issues. In this paper, we revisit this insight in light of recent multilingual MT models and large language models (LLMs) applied to machine translation. Somewhat surprisingly, we show through controlled experiments that these models are far more robust to many kinds of noise than previous models, even when they perform similarly on clean data. This is notable because, even though LLMs have more parameters and more complex training processes than past models, none of the open ones we consider use any techniques specifically designed to encourage robustness. Next, we show that similar trends hold for social media translation experiments -- LLMs are more robust to social media text. We include an analysis of the circumstances in which source correction techniques can be used to mitigate the effects of noise. Altogether, we show that robustness to many types of noise has increased.
Paper Structure (50 sections, 4 figures, 21 tables)

This paper contains 50 sections, 4 figures, 21 tables.

Figures (4)

  • Figure 1: COMET-22 on the FLORES English-French devtest set when some proportion of source tokens are noised by swapping an adjacent pair of characters.
  • Figure 2: COMET on en$\rightarrow$fr swaps.
  • Figure 3: OPUS en$\rightarrow$pt swaps with finetuning and SC.
  • Figure 4: Percentage of en$\rightarrow$pt swap examples for which finetuning OPUS (top), correcting OPUS (middle), or correcting TI (bottom) outperforms the baseline.