Table of Contents
Fetching ...

X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale

Haoran Xu, Kenton Murray, Philipp Koehn, Hieu Hoang, Akiko Eriguchi, Huda Khayrallah

TL;DR

X-ALMA introduces a plug-and-play, language-grouped architecture with language-specific LS modules to deliver uniformly high translation quality across 50 languages. It couples this architecture with a five-stage training recipe and a novel Adaptive-Rejection Preference Optimization (ARPO) to address over-rejection in MT preference learning, achieving superior COMET-22 performance on FLORES-200 and WMT'23 versus open multilingual models. The results include 97 of 98 directions matching or exceeding baselines (XCOMET-XL), and the authors release preference data and checkpoints to support reproducibility. This work advances multilingual translation by combining modular design with targeted preference optimization to mitigate multilinguality trade-offs and resource disparities across languages.

Abstract

Large language models (LLMs) have achieved remarkable success across various NLP tasks with a focus on English due to English-centric pre-training and limited multilingual data. In this work, we focus on the problem of translation, and while some multilingual LLMs claim to support for hundreds of languages, models often fail to provide high-quality responses for mid- and low-resource languages, leading to imbalanced performance heavily skewed in favor of high-resource languages. We introduce **X-ALMA**, a model designed to ensure top-tier performance across 50 diverse languages, regardless of their resource levels. X-ALMA surpasses state-of-the-art open-source multilingual LLMs, such as Aya-101 and Aya-23, in every single translation direction on the FLORES-200 and WMT'23 test datasets according to COMET-22. This is achieved by plug-and-play language-specific module architecture to prevent language conflicts during training and a carefully designed training regimen with novel optimization methods to maximize the translation performance. After the final stage of training regimen, our proposed **A**daptive **R**ejection **P**reference **O**ptimization (**ARPO**) surpasses existing preference optimization methods in translation tasks.

X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale

TL;DR

X-ALMA introduces a plug-and-play, language-grouped architecture with language-specific LS modules to deliver uniformly high translation quality across 50 languages. It couples this architecture with a five-stage training recipe and a novel Adaptive-Rejection Preference Optimization (ARPO) to address over-rejection in MT preference learning, achieving superior COMET-22 performance on FLORES-200 and WMT'23 versus open multilingual models. The results include 97 of 98 directions matching or exceeding baselines (XCOMET-XL), and the authors release preference data and checkpoints to support reproducibility. This work advances multilingual translation by combining modular design with targeted preference optimization to mitigate multilinguality trade-offs and resource disparities across languages.

Abstract

Large language models (LLMs) have achieved remarkable success across various NLP tasks with a focus on English due to English-centric pre-training and limited multilingual data. In this work, we focus on the problem of translation, and while some multilingual LLMs claim to support for hundreds of languages, models often fail to provide high-quality responses for mid- and low-resource languages, leading to imbalanced performance heavily skewed in favor of high-resource languages. We introduce **X-ALMA**, a model designed to ensure top-tier performance across 50 diverse languages, regardless of their resource levels. X-ALMA surpasses state-of-the-art open-source multilingual LLMs, such as Aya-101 and Aya-23, in every single translation direction on the FLORES-200 and WMT'23 test datasets according to COMET-22. This is achieved by plug-and-play language-specific module architecture to prevent language conflicts during training and a carefully designed training regimen with novel optimization methods to maximize the translation performance. After the final stage of training regimen, our proposed **A**daptive **R**ejection **P**reference **O**ptimization (**ARPO**) surpasses existing preference optimization methods in translation tasks.
Paper Structure (32 sections, 6 equations, 7 figures, 14 tables)

This paper contains 32 sections, 6 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Depiction of the general inverse trend between the number of supported languages and average translation performance. While many state-of-the-art multilingual models claim to support hundreds of languages, the translation quality is not as high as in models trained on fewer languages, particularly for mid- and low-resource languages. This is reflected in the trend of decreasing average scores as more languages are supported. In contrast, we propose X-ALMA, which extends ALMA-R almaalma-r by supporting 44 additional diverse languages with even higher average performance, offering top performance across all supported languages, regardless of resource level.
  • Figure 2: High-level architecture design of the plug-and-play multilingual model. Each language group is assigned a specific module that works alongside the base model. These language-specific modules handle inputs exclusively from their respective language groups, enabling the model to effectively adapt to different linguistic characteristics while leveraging the shared base model for comprehensive multilingual learning.
  • Figure 3: Left: ablation study on each stage of the training recipe, demonstrating that adding each stage leads to consistent performance improvements. Right: ablation study on the impact of parallel data composition during the SFT stage. Adding WMT data to NTREX significantly enhances model performance, while adding Flores-200 data provides no noticeable improvement.
  • Figure 4: This diagram of the multi-stage process of fine-tuning a multilingual model. In Pre-Training Stage 1, the base model is fine-tuned using 20B tokens of monolingual data from 50 languages. The process continues with Pre-Training Stage 2, where language-specific modules are fine-tuned with 10B monolingual tokens. Pre-Training Stage 3 introduces pseudo-monolingual fine-tuning, using randomly concatenated parallel sentences to improve multilingual alignment. The model then undergoes Post-Training Stage 1, where SFT is performed on high-quality parallel data, followed by Post-Training Stage 2, which applies Adaptive Contrastive Preference Optimization to address over-rejection issues in translation preference learning.
  • Figure 5: Cumulative distribution of reward differences between machine translation and open-ended question answering tasks in contrastive preference optimization.
  • ...and 2 more figures