Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

Víctor M. Sánchez-Cartagena; Miquel Esplà-Gomis; Juan Antonio Pérez-Ortiz; Felipe Sánchez-Martínez

Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

Víctor M. Sánchez-Cartagena, Miquel Esplà-Gomis, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez

TL;DR

This work introduces MaTiLDA, a multi-task data augmentation strategy for neural machine translation that uses deliberately non-fluent synthetic target data to strengthen encoder representations. By applying multiple transformations to target sentences and training with a per-sample task token, MaTiLDA averages over several synthetic variants in a single loss, improving translation quality across ten low-resource and four high-resource tasks and enhancing domain robustness while reducing hallucinations. It also demonstrates that MaTiLDA complements back-translation, yielding further gains when used together, and provides explainability analyses showing increased reliance on source information. The method is simple to implement, architecture-agnostic, and scalable to existing NMT pipelines, with strong implications for boosting MT performance in low-resource scenarios and under domain shift.

Abstract

When the amount of parallel sentences available to train a neural machine translation is scarce, a common practice is to generate new synthetic training samples from them. A number of approaches have been proposed to produce synthetic parallel sentences that are similar to those in the parallel data available. These approaches work under the assumption that non-fluent target-side synthetic training samples can be harmful and may deteriorate translation performance. Even so, in this paper we demonstrate that synthetic training samples with non-fluent target sentences can improve translation performance if they are used in a multilingual machine translation framework as if they were sentences in another language. We conducted experiments on ten low-resource and four high-resource translation tasks and found out that this simple approach consistently improves translation performance as compared to state-of-the-art methods for generating synthetic training samples similar to those found in corpora. Furthermore, this improvement is independent of the size of the original training corpus, the resulting systems are much more robust against domain shift and produce less hallucinations.

Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

TL;DR

Abstract

Paper Structure (16 sections, 8 equations, 3 figures, 12 tables)

This paper contains 16 sections, 8 equations, 3 figures, 12 tables.

Introduction
Neural machine translation
Data augmentation strategies
Experimental settings
Datasets
Training
Results and discussion
Low-resource conditions
Combination with back-translation
High-resource conditions
Domain robustness
Explainability
Relative source and target contributions
Hallucinations
Related work
...and 1 more sections

Figures (3)

Figure 1: Source influence throughout relative target sentence positions for the English--German low-resource in-domain test set.
Figure 2: Kernel density estimations (bandwidth=$0.06$) for LaBSE-based cosine similarities between the output produced by NMT models trained in low-resource conditions and the reference translations in test sets belonging to different domains. DA methods: baseline, SwitchOut+RAML, SeqMix, MaTiLDA.
Figure 3: Kernel density estimations (bandwidth=$0.06$) for LaBSE-based cosine similarities between the output produced by NMT models trained in high-resource conditions and the reference translations in test sets belonging to different domains. DA methods: baseline, SwitchOut+RAML, SeqMix, MaTiLDA.

Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

TL;DR

Abstract

Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)