Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

Yuzhe Shang; Pengzhi Gao; Wei Liu; Jian Luan; Jinsong Su

Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

Yuzhe Shang, Pengzhi Gao, Wei Liu, Jian Luan, Jinsong Su

TL;DR

This work systematically analyzes how model and data scaling affect adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning, spanning 46 languages. Building on Gemma3, it introduces MiLMMT-46, a set of open multilingual MT models that outperform open baselines and rival proprietary systems. The study reveals that increasing pretraining data yields reliable gains across model sizes, while larger models show data-efficient benefits during instruction finetuning, with diminishing returns in some metrics. Public releases of MiLMMT-46 and the scaling insights support scalable, transparent multilingual translation with open LLMs, and the findings motivate future work in RL-based alignment and larger-scale exploration.

Abstract

Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.

Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 6 figures, 24 tables)

This paper contains 18 sections, 1 equation, 6 figures, 24 tables.

Introduction
Related Work
Datasets and Baseline Settings
Datasets
Models
Evaluation
Benchmarking Open LLMs for Multilingual Machine Translation
Tokenizer Efficiency
In-context Multilingual Translation Performance with Open LLMs
Model and Data Scaling for Multilingual MT with Open LLMs
Pretraining Data
Monolingual Data
Parallel Data
Supervised Finetuning Data
Exploring Model and Data Scaling for Multilingual Translation with LLMs
...and 3 more sections

Figures (6)

Figure 1: The tokenizer efficiency of open-source LLMs for each non-English language. The smaller the length ratio is, the more efficient the tokenizer is. The detailed results are summarized in Table \ref{['tab:tokenization']}.
Figure 2: The translation performance (COMET) of different models trained with different $n$ during continual pretraining stage. The translation performance in BLEU scores is illustrated in Figure \ref{['fig:scaling_cpt_bleu']}.
Figure 3: The translation performance (COMET) of different models trained with varying numbers of sentence pairs during the instruction finetuning stage. Translation performance measured by BLEU is shown in Figure \ref{['fig:scaling_sft_bleu']}.
Figure 4: Number of sentence pairs for simplified Chinese-centric and English-centric parallel datasets.
Figure 5: The translation performance (spBLEU) of different models trained with different $n$ during continual pretraining stage.
...and 1 more figures

Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

TL;DR

Abstract

Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)