Omnilingual MT: Machine Translation for 1,600 Languages

Omnilingual MT Team; Belen Alastruey; Niyati Bafna; Andrea Caciolai; Kevin Heffernan; Artyom Kozhevnikov; Christophe Ropers; Eduardo Sánchez; Charles-Eric Saint-James; Ioannis Tsiamas; Chierh Cheng; Joe Chuang; Paul-Ambroise Duquenne; Mark Duppenthaler; Nate Ekberg; Cynthia Gao; Pere Lluís Huguet Cabot; João Maria Janeiro; Jean Maillard; Gabriel Mejia Gonzalez; Holger Schwenk; Edan Toledo; Arina Turkatenko; Albert Ventayol-Boada; Rashel Moritz; Alexandre Mourachko; Surya Parimi; Mary Williamson; Shireen Yates; David Dale; Marta R. Costa-jussà

Omnilingual MT: Machine Translation for 1,600 Languages

Omnilingual MT Team, Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Chierh Cheng, Joe Chuang, Paul-Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, Gabriel Mejia Gonzalez, Holger Schwenk, Edan Toledo, Arina Turkatenko, Albert Ventayol-Boada, Rashel Moritz, Alexandre Mourachko, Surya Parimi, Mary Williamson, Shireen Yates, David Dale, Marta R. Costa-jussà

Abstract

High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

Omnilingual MT: Machine Translation for 1,600 Languages

Abstract

Paper Structure (36 sections, 3 figures, 6 tables)

This paper contains 36 sections, 3 figures, 6 tables.

Introduction
Expanding Machine Translation
Languages
Referring to languages
Quality translations from or into underserved languages
Determining pivot languages
Providing contextual information
Resource levels
Describing languages in prompts
Creating High-Quality Datasets
Main CPT Training Data Collection
Monolingual Datasets
Bible texts
Panlex
Tatoeba
...and 21 more sections

Figures (3)

Figure 3.1: Correlation between translation quality (OMT-LLaMA model, Bible benchmark of 1,560 languages, mean xCOMET score) and amount of parallel documents from primary sources (not mined or synthetic). We fit an isotonic regression to show the global trend.
Figure 3.2: Graph with distribution of languages per resource bucket. Note that we count all languages for which we have some data (including monolingual data and word-level parallel data like Panlex), but the buckets are determined based on the parallel data that is at least (and predominantly) sentence-level.
Figure 4.1: Steps in the creation of MeDLEy-source and MeDLEy-109. This includes (1) enumeration of grammatical features, (2) template generation including domain and source language assignment, (3) manual creation of paragraphs in 5 source languages: English, Mandarin, Spanish, Russian, and German, and (4) n-way parallelization (via English) across 8 pivot languages: English, Mandarin, Spanish, Russian, Hindi, Indonesian, Swahili, and French, resulting in MeDLEy-source. This is then (5) translated into 109 low-resource languages, each from a convenient pivot depending on the translator, resulting in MeDLEy-109.

Omnilingual MT: Machine Translation for 1,600 Languages

Abstract

Omnilingual MT: Machine Translation for 1,600 Languages

Authors

Abstract

Table of Contents

Figures (3)