Table of Contents
Fetching ...

The Missing Ingredient in Zero-Shot Neural Machine Translation

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, Wolfgang Macherey

TL;DR

This work identifies the failure of purely parameter-sharing multilingual NMT to generalize to zero-shot language pairs. It reframes zero-shot translation as a domain adaptation problem and introduces two encoder-focused regularizers to enforce language-invariant latent representations, using English as a pivot. The proposed distribution-based and instance-based alignment methods substantially close the gap to pivoting without sacrificing supervised performance and scale effectively to more languages. The results on WMT14 and IWSLT17 demonstrate that explicit latent alignment can be a simple, scalable, and effective ingredient for improving zero-shot translation quality in multilingual models.

Abstract

Multilingual Neural Machine Translation (NMT) models are capable of translating between multiple source and target languages. Despite various approaches to train such models, they have difficulty with zero-shot translation: translating between language pairs that were not together seen during training. In this paper we first diagnose why state-of-the-art multilingual NMT models that rely purely on parameter sharing, fail to generalize to unseen language pairs. We then propose auxiliary losses on the NMT encoder that impose representational invariance across languages. Our simple approach vastly improves zero-shot translation quality without regressing on supervised directions. For the first time, on WMT14 English-FrenchGerman, we achieve zero-shot performance that is on par with pivoting. We also demonstrate the easy scalability of our approach to multiple languages on the IWSLT 2017 shared task.

The Missing Ingredient in Zero-Shot Neural Machine Translation

TL;DR

This work identifies the failure of purely parameter-sharing multilingual NMT to generalize to zero-shot language pairs. It reframes zero-shot translation as a domain adaptation problem and introduces two encoder-focused regularizers to enforce language-invariant latent representations, using English as a pivot. The proposed distribution-based and instance-based alignment methods substantially close the gap to pivoting without sacrificing supervised performance and scale effectively to more languages. The results on WMT14 and IWSLT17 demonstrate that explicit latent alignment can be a simple, scalable, and effective ingredient for improving zero-shot translation quality in multilingual models.

Abstract

Multilingual Neural Machine Translation (NMT) models are capable of translating between multiple source and target languages. Despite various approaches to train such models, they have difficulty with zero-shot translation: translating between language pairs that were not together seen during training. In this paper we first diagnose why state-of-the-art multilingual NMT models that rely purely on parameter sharing, fail to generalize to unseen language pairs. We then propose auxiliary losses on the NMT encoder that impose representational invariance across languages. Our simple approach vastly improves zero-shot translation quality without regressing on supervised directions. For the first time, on WMT14 English-FrenchGerman, we achieve zero-shot performance that is on par with pivoting. We also demonstrate the easy scalability of our approach to multiple languages on the IWSLT 2017 shared task.

Paper Structure

This paper contains 23 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The proposed multilingual NMT model along with alignment. $x$ and $y$ are a pair of translations sampled from available data, $D_{X,Y}$. One of $x$ or $y$ is always English. $z_x$ and $z_y$ are the encoder representations of the $x$ and $y$, respectively. $\tilde{y}$ is the decoder prediction. $CE$ is the standard cross-entropy loss associated with maximum likelihood training for NMT. $\Omega$ is the alignment loss. Both, $CE$ and $\Omega$, losses are minimized simultaneously.
  • Figure 2: Average cosine distance between aligned context vectors for all combinations of English (en), German (de) and French (fr) as training progresses.