Table of Contents
Fetching ...

When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale

Christos Baziotis, Biao Zhang, Alexandra Birch, Barry Haddow

TL;DR

The paper investigates when monolingual data helps multilingual translation by systematically varying domain alignment and model scale. It compares backtranslation (BT) with two denoising autoencoding objectives (MASS and BART) across 100 translation directions and multiple test domains, using Wiki as a single-domain monolingual source and mixed-domain data. Key findings show substantial domain brittleness at small scales, with BT outperforming DAE in many settings but suffering when domains misalign, while MASS becomes increasingly competitive as model capacity grows, especially for low-resource directions. The work provides practical guidance: diversify monolingual data, prefer BT for in-domain scenarios, and scale models to unlock the benefits of DAE, with MASS typically outperforming BART and offering a cheaper alternative to BT in large-scale MMT.

Abstract

Multilingual machine translation (MMT), trained on a mixture of parallel and monolingual data, is key for improving translation in low-resource language pairs. However, the literature offers conflicting results on the performance of different methods of including monolingual data. To resolve this, we examine how denoising autoencoding (DAE) and backtranslation (BT) impact MMT under different data conditions and model scales. Unlike prior studies, we use a realistic dataset of 100 translation directions and consider many domain combinations of monolingual and test data. We find that monolingual data generally helps MMT, but models are surprisingly brittle to domain mismatches, especially at smaller model scales. BT is beneficial when the parallel, monolingual, and test data sources are similar but can be detrimental otherwise, while DAE is less effective than previously reported. Next, we analyze the impact of scale (from 90M to 1.6B parameters) and find it is important for both methods, particularly DAE. As scale increases, DAE transitions from underperforming the parallel-only baseline at 90M to converging with BT performance at 1.6B, and even surpassing it in low-resource. These results offer new insights into how to best use monolingual data in MMT.

When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale

TL;DR

The paper investigates when monolingual data helps multilingual translation by systematically varying domain alignment and model scale. It compares backtranslation (BT) with two denoising autoencoding objectives (MASS and BART) across 100 translation directions and multiple test domains, using Wiki as a single-domain monolingual source and mixed-domain data. Key findings show substantial domain brittleness at small scales, with BT outperforming DAE in many settings but suffering when domains misalign, while MASS becomes increasingly competitive as model capacity grows, especially for low-resource directions. The work provides practical guidance: diversify monolingual data, prefer BT for in-domain scenarios, and scale models to unlock the benefits of DAE, with MASS typically outperforming BART and offering a cheaper alternative to BT in large-scale MMT.

Abstract

Multilingual machine translation (MMT), trained on a mixture of parallel and monolingual data, is key for improving translation in low-resource language pairs. However, the literature offers conflicting results on the performance of different methods of including monolingual data. To resolve this, we examine how denoising autoencoding (DAE) and backtranslation (BT) impact MMT under different data conditions and model scales. Unlike prior studies, we use a realistic dataset of 100 translation directions and consider many domain combinations of monolingual and test data. We find that monolingual data generally helps MMT, but models are surprisingly brittle to domain mismatches, especially at smaller model scales. BT is beneficial when the parallel, monolingual, and test data sources are similar but can be detrimental otherwise, while DAE is less effective than previously reported. Next, we analyze the impact of scale (from 90M to 1.6B parameters) and find it is important for both methods, particularly DAE. As scale increases, DAE transitions from underperforming the parallel-only baseline at 90M to converging with BT performance at 1.6B, and even surpassing it in low-resource. These results offer new insights into how to best use monolingual data in MMT.
Paper Structure (33 sections, 16 figures, 41 tables)

This paper contains 33 sections, 16 figures, 41 tables.

Figures (16)

  • Figure 1: Illustration of the MASS objective.
  • Figure 2: Illustration of the BART objective.
  • Figure 3: BLEU differences between each model and the parallel-only model (red dotted line) on the ML50 test data.
  • Figure 4: BLEU differences between each model and the baseline (red dotted line) on FLORES and NTREX.
  • Figure 5: Data sources used for the ML50 test sets.
  • ...and 11 more figures