Table of Contents
Fetching ...

Extending Multilingual Machine Translation through Imitation Learning

Wen Lai, Viktor Hangya, Yingli Shen, Alexander Fraser

TL;DR

Problem: extend MNMT to new languages without access to original training data and without forgetting existing languages. Approach: formulate extension as imitation learning, using an expert to generate pseudo-parallel data and a learner to imitate data distribution and translation behavior with language-weighted objectives. Findings: Imit-MNMT improves translations between the new language and all existing languages while preserving original performance, reduces copy and off-target errors, and shows script-based transfer. Significance: enables scalable, data-efficient expansion of MNMT systems to many languages and can generalize to other models and NLP tasks.

Abstract

Despite the growing variety of languages supported by existing multilingual neural machine translation (MNMT) models, most of the world's languages are still being left behind. We aim to extend large-scale MNMT models to incorporate a new language, enabling translations between this new language and all previously supported languages, even in the challenging scenario where only a parallel corpus between the new language and English is available. Previous methods, such as continued training on parallel data including the new language, often suffer from catastrophic forgetting, which degrades performance on other languages. We propose a novel approach Imit-MNMT which treats this task as an imitation learning problem, a technique widely used in computer vision but less explored in natural language processing. Specifically, we leverage an expert model to generate pseudo-parallel corpora between the new language and the existing languages. We then introduce a data distribution imitation strategy using language-specific weighting, alongside a translation behavior imitation mechanism. Extensive experiments show that our approach significantly improves translation performance between the new and existing languages while mitigating catastrophic forgetting.

Extending Multilingual Machine Translation through Imitation Learning

TL;DR

Problem: extend MNMT to new languages without access to original training data and without forgetting existing languages. Approach: formulate extension as imitation learning, using an expert to generate pseudo-parallel data and a learner to imitate data distribution and translation behavior with language-weighted objectives. Findings: Imit-MNMT improves translations between the new language and all existing languages while preserving original performance, reduces copy and off-target errors, and shows script-based transfer. Significance: enables scalable, data-efficient expansion of MNMT systems to many languages and can generalize to other models and NLP tasks.

Abstract

Despite the growing variety of languages supported by existing multilingual neural machine translation (MNMT) models, most of the world's languages are still being left behind. We aim to extend large-scale MNMT models to incorporate a new language, enabling translations between this new language and all previously supported languages, even in the challenging scenario where only a parallel corpus between the new language and English is available. Previous methods, such as continued training on parallel data including the new language, often suffer from catastrophic forgetting, which degrades performance on other languages. We propose a novel approach Imit-MNMT which treats this task as an imitation learning problem, a technique widely used in computer vision but less explored in natural language processing. Specifically, we leverage an expert model to generate pseudo-parallel corpora between the new language and the existing languages. We then introduce a data distribution imitation strategy using language-specific weighting, alongside a translation behavior imitation mechanism. Extensive experiments show that our approach significantly improves translation performance between the new and existing languages while mitigating catastrophic forgetting.
Paper Structure (32 sections, 9 equations, 5 figures, 10 tables)

This paper contains 32 sections, 9 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Imit-MNMT consists three parts. In each time step $t$, the expert emit three actions corresponds to parallel data between $(\ell_{new}, \ell_{eng})$, $(\ell_{eng}, \ell_{k})$ and $(\ell_{new}, \ell_{k})$. Then, the data distribution for $\ell_{new}$ in the learner model is imitated by using a language weighting strategy. Finally, the translation behavior is imitated by two types of loss function ($\mathcal{L}_{obs}$ and $\mathcal{L}_{imit}$), where $\mathcal{L}_{obs}$ is designed for mitigating catastrophic forgetting.
  • Figure 2: Corpus size analysis: the four languages on the left are those used in the main experiment with corpus sizes in the tens of millions, while the latter four are comparison languages with corpus sizes in the hundreds of thousands.
  • Figure 3: BLEU score distribution statistics: the original model, the extended model for translations from the original languages to the new language, and the extended model for translations from the new language to the original languages.
  • Figure 4: Model size analysis: a comparison of the performance between the 418M model and the 1.2B model (used in main experiments) across three language categories.
  • Figure 5: k-value analysis: performance comparison between k values of 5 and 10.