Table of Contents
Fetching ...

Many-to-English Machine Translation Tools, Data, and Pretrained Models

Thamme Gowda, Zhao Zhang, Chris A Mattmann, Jonathan May

TL;DR

This work creates a multilingual neural machine translation model capable of translating from 500 source languages to English, readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.

Abstract

While there are more than 7000 languages in the world, most translation research efforts have targeted a few high-resource languages. Commercial translation systems support only one hundred languages or fewer, and do not make these models available for transfer to low resource languages. In this work, we present useful tools for machine translation research: MTData, NLCodec, and RTG. We demonstrate their usefulness by creating a multilingual neural machine translation model capable of translating from 500 source languages to English. We make this multilingual model readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.

Many-to-English Machine Translation Tools, Data, and Pretrained Models

TL;DR

This work creates a multilingual neural machine translation model capable of translating from 500 source languages to English, readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.

Abstract

While there are more than 7000 languages in the world, most translation research efforts have targeted a few high-resource languages. Commercial translation systems support only one hundred languages or fewer, and do not make these models available for transfer to low resource languages. In this work, we present useful tools for machine translation research: MTData, NLCodec, and RTG. We demonstrate their usefulness by creating a multilingual neural machine translation model capable of translating from 500 source languages to English. We make this multilingual model readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.

Paper Structure

This paper contains 15 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Training data statistics for the 500 languages, sorted based on descending order of English token count. These statistics are obtained after de-duplication and filtering (see Section \ref{['sec:datasets']}). The full name for these ISO 639-3 codes can be looked up using MTData, e.g. mtdata-iso eng .
  • Figure 2: Many-to-English BLEU on OPUS-100 tests zhang-etal-2020-multiling-nmt. Despite having four times more languages on the source side, our model scores competitive BLEU on most languages with the strongest system of zhang-etal-2020-multiling-nmt. The tests where our model scores lower BLEU have shorter source sentences (mean length of about three tokens).
  • Figure 3: RTG Web Interface