Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models
Yuzhe Shang, Pengzhi Gao, Wei Liu, Jian Luan, Jinsong Su
TL;DR
This work systematically analyzes how model and data scaling affect adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning, spanning 46 languages. Building on Gemma3, it introduces MiLMMT-46, a set of open multilingual MT models that outperform open baselines and rival proprietary systems. The study reveals that increasing pretraining data yields reliable gains across model sizes, while larger models show data-efficient benefits during instruction finetuning, with diminishing returns in some metrics. Public releases of MiLMMT-46 and the scaling insights support scalable, transparent multilingual translation with open LLMs, and the findings motivate future work in RL-based alignment and larger-scale exploration.
Abstract
Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.
