Table of Contents
Fetching ...

Investigating Backtranslation in Neural Machine Translation

Alberto Poncelas, Dimitar Shterionov, Andy Way, Gideon Maillette de Buy Wenniger, Peyman Passban

TL;DR

This study systematically evaluates back-translation as a training data strategy for neural machine translation by comparing authentic (human-translated), synthetic (back-translated), and hybrid datasets using German→English data from WMT 2015. Using a consistent OpenNMT-py setup, the authors show that while authentic data generally improves with more data, synthetic data alone can achieve near-parity with authentic data in some metrics, and hybrid data often yields the strongest gains when synthetic data is balanced properly. A key finding is the existence of a tipping point: excessive synthetic data relative to authentic data can degrade performance, although the exact ratio depends on data size and metrics; these results have practical implications for resource-poor scenarios where back-translation can bootstrap MT systems. The work provides actionable guidance on leveraging back-translation, highlighting that careful balancing of data sources is crucial for optimal NMT performance across language pairs and domains.

Abstract

A prerequisite for training corpus-based machine translation (MT) systems -- either Statistical MT (SMT) or Neural MT (NMT) -- is the availability of high-quality parallel data. This is arguably more important today than ever before, as NMT has been shown in many studies to outperform SMT, but mostly when large parallel corpora are available; in cases where data is limited, SMT can still outperform NMT. Recently researchers have shown that back-translating monolingual data can be used to create synthetic parallel corpora, which in turn can be used in combination with authentic parallel data to train a high-quality NMT system. Given that large collections of new parallel text become available only quite rarely, backtranslation has become the norm when building state-of-the-art NMT systems, especially in resource-poor scenarios. However, we assert that there are many unknown factors regarding the actual effects of back-translated data on the translation capabilities of an NMT model. Accordingly, in this work we investigate how using back-translated data as a training corpus -- both as a separate standalone dataset as well as combined with human-generated parallel data -- affects the performance of an NMT model. We use incrementally larger amounts of back-translated data to train a range of NMT systems for German-to-English, and analyse the resulting translation performance.

Investigating Backtranslation in Neural Machine Translation

TL;DR

This study systematically evaluates back-translation as a training data strategy for neural machine translation by comparing authentic (human-translated), synthetic (back-translated), and hybrid datasets using German→English data from WMT 2015. Using a consistent OpenNMT-py setup, the authors show that while authentic data generally improves with more data, synthetic data alone can achieve near-parity with authentic data in some metrics, and hybrid data often yields the strongest gains when synthetic data is balanced properly. A key finding is the existence of a tipping point: excessive synthetic data relative to authentic data can degrade performance, although the exact ratio depends on data size and metrics; these results have practical implications for resource-poor scenarios where back-translation can bootstrap MT systems. The work provides actionable guidance on leveraging back-translation, highlighting that careful balancing of data sources is crucial for optimal NMT performance across language pairs and domains.

Abstract

A prerequisite for training corpus-based machine translation (MT) systems -- either Statistical MT (SMT) or Neural MT (NMT) -- is the availability of high-quality parallel data. This is arguably more important today than ever before, as NMT has been shown in many studies to outperform SMT, but mostly when large parallel corpora are available; in cases where data is limited, SMT can still outperform NMT. Recently researchers have shown that back-translating monolingual data can be used to create synthetic parallel corpora, which in turn can be used in combination with authentic parallel data to train a high-quality NMT system. Given that large collections of new parallel text become available only quite rarely, backtranslation has become the norm when building state-of-the-art NMT systems, especially in resource-poor scenarios. However, we assert that there are many unknown factors regarding the actual effects of back-translated data on the translation capabilities of an NMT model. Accordingly, in this work we investigate how using back-translated data as a training corpus -- both as a separate standalone dataset as well as combined with human-generated parallel data -- affects the performance of an NMT model. We use incrementally larger amounts of back-translated data to train a range of NMT systems for German-to-English, and analyse the resulting translation performance.

Paper Structure

This paper contains 10 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Quality scores of NMT systems trained with different sizes of training data from the auth$_{0+}$ and hybr sets.
  • Figure 2: Quality scores of NMT systems trained with different sizes of training data from the auth$_{1+}$ and synth sets.