Table of Contents
Fetching ...

The Impact of Syntactic and Semantic Proximity on Machine Translation with Back-Translation

Nicolas Guerin, Shane Steinert-Threlkeld, Emmanuel Chemla

TL;DR

It is found that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through back-translation, and it is conjecture that rich semantic dependencies, parallel across languages, are at the root of the success of unsupervised methods based on back-translation.

Abstract

Unsupervised on-the-fly back-translation, in conjunction with multilingual pretraining, is the dominant method for unsupervised neural machine translation. Theoretically, however, the method should not work in general. We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical, syntactic, and semantic properties. We find, contrary to popular belief, that (i) parallel word frequency distributions, (ii) partially shared vocabulary, and (iii) similar syntactic structure across languages are not sufficient to explain the success of back-translation. We show however that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through back-translation. We conjecture that rich semantic dependencies, parallel across languages, are at the root of the success of unsupervised methods based on back-translation. Overall, the success of unsupervised machine translation was far from being analytically guaranteed. Instead, it is another proof that languages of the world share deep similarities, and we hope to show how to identify which of these similarities can serve the development of unsupervised, cross-linguistic tools.

The Impact of Syntactic and Semantic Proximity on Machine Translation with Back-Translation

TL;DR

It is found that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through back-translation, and it is conjecture that rich semantic dependencies, parallel across languages, are at the root of the success of unsupervised methods based on back-translation.

Abstract

Unsupervised on-the-fly back-translation, in conjunction with multilingual pretraining, is the dominant method for unsupervised neural machine translation. Theoretically, however, the method should not work in general. We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical, syntactic, and semantic properties. We find, contrary to popular belief, that (i) parallel word frequency distributions, (ii) partially shared vocabulary, and (iii) similar syntactic structure across languages are not sufficient to explain the success of back-translation. We show however that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through back-translation. We conjecture that rich semantic dependencies, parallel across languages, are at the root of the success of unsupervised methods based on back-translation. Overall, the success of unsupervised machine translation was far from being analytically guaranteed. Instead, it is another proof that languages of the world share deep similarities, and we hope to show how to identify which of these similarities can serve the development of unsupervised, cross-linguistic tools.
Paper Structure (27 sections, 1 equation, 3 figures, 9 tables)

This paper contains 27 sections, 1 equation, 3 figures, 9 tables.

Figures (3)

  • Figure 1: The components of a classical translation pipeline. Translation functions $T_{12}$ and $T_{21}$ are obtained from the composition of encoders and decoders from different languages.
  • Figure 2: Schematic representation of three cases in which the back-translation objective (\ref{['eq:loss_bt']}) is fully met: (a) with an accurate translation, (b) with a translation missing the target language entirely, but being sent back on the original language appropriately, (c) with a translation that bijectively shuffles the target language, and a reverse translation that unshuffles it back in place. We ignore here the hidden encoding space, and write $\longrightarrow$ for $T_{12}$ and $\longleftarrow$ for $T_{21}$.
  • Figure 3: The red (resp. blue) line represents the accuracy (resp. recall) of the lexicon word translation. These are ranked by their frequency in the training corpus. The dashed lines represent word counts in the testset (light blue) and in the model translation (grey). It can be seen that the model tracks actual frequencies well. Then, very few words have a non-zero score, which is consistent with a very low BLEU score. What's more surprising is that the most frequent words are not translated any better.