Table of Contents
Fetching ...

Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

Yuzhe Shang, Pengzhi Gao, Wei Liu, Jian Luan, Jinsong Su

TL;DR

This work systematically analyzes how model and data scaling affect adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning, spanning 46 languages. Building on Gemma3, it introduces MiLMMT-46, a set of open multilingual MT models that outperform open baselines and rival proprietary systems. The study reveals that increasing pretraining data yields reliable gains across model sizes, while larger models show data-efficient benefits during instruction finetuning, with diminishing returns in some metrics. Public releases of MiLMMT-46 and the scaling insights support scalable, transparent multilingual translation with open LLMs, and the findings motivate future work in RL-based alignment and larger-scale exploration.

Abstract

Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.

Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

TL;DR

This work systematically analyzes how model and data scaling affect adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning, spanning 46 languages. Building on Gemma3, it introduces MiLMMT-46, a set of open multilingual MT models that outperform open baselines and rival proprietary systems. The study reveals that increasing pretraining data yields reliable gains across model sizes, while larger models show data-efficient benefits during instruction finetuning, with diminishing returns in some metrics. Public releases of MiLMMT-46 and the scaling insights support scalable, transparent multilingual translation with open LLMs, and the findings motivate future work in RL-based alignment and larger-scale exploration.

Abstract

Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.
Paper Structure (18 sections, 1 equation, 6 figures, 24 tables)

This paper contains 18 sections, 1 equation, 6 figures, 24 tables.

Figures (6)

  • Figure 1: The tokenizer efficiency of open-source LLMs for each non-English language. The smaller the length ratio is, the more efficient the tokenizer is. The detailed results are summarized in Table \ref{['tab:tokenization']}.
  • Figure 2: The translation performance (COMET) of different models trained with different $n$ during continual pretraining stage. The translation performance in BLEU scores is illustrated in Figure \ref{['fig:scaling_cpt_bleu']}.
  • Figure 3: The translation performance (COMET) of different models trained with varying numbers of sentence pairs during the instruction finetuning stage. Translation performance measured by BLEU is shown in Figure \ref{['fig:scaling_sft_bleu']}.
  • Figure 4: Number of sentence pairs for simplified Chinese-centric and English-centric parallel datasets.
  • Figure 5: The translation performance (spBLEU) of different models trained with different $n$ during continual pretraining stage.
  • ...and 1 more figures