Table of Contents
Fetching ...

Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study

Menglong Cui, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang

TL;DR

The study benchmarks open LLMs under 10B parameters on multilingual translation across 28 languages, revealing strong capabilities but gaps relative to closed models. It introduces Parallel-First Monolingual-Second (PFMS) data mixing and demonstrates that pretraining Gemma2-9B with PFMS, followed by translation-focused finetuning, yields GemmaX2-28-9B, a open-model translator with translation quality rivaling GPT-4-turbo and Google Translate on many directions. The work provides a practical data recipe and releases GemmaX2-28-9B, highlighting the potential of open LLMs in scalable multilingual MT while outlining limitations and avenues for future scale-up and language expansion.

Abstract

Large language models (LLMs) have shown continuously improving multilingual capabilities, and even small-scale open-source models have demonstrated rapid performance enhancement. In this paper, we systematically explore the abilities of open LLMs with less than ten billion parameters to handle multilingual machine translation (MT) tasks. We conduct comprehensive evaluations on six popular LLMs and find that models like Gemma2-9B exhibit impressive multilingual translation capabilities. We then introduce the Parallel-First Monolingual-Second (PFMS) data mixing strategy in the continual pretraining stage to further enhance the MT performance and present GemmaX2-28, a 9B model achieving top-tier multilingual translation performance across 28 languages. Specifically, GemmaX2-28 consistently outperforms the state-of-the-art (SOTA) models such as TowerInstruct and XALMA and achieves competitive performance with Google Translate and GPT-4-turbo.

Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study

TL;DR

The study benchmarks open LLMs under 10B parameters on multilingual translation across 28 languages, revealing strong capabilities but gaps relative to closed models. It introduces Parallel-First Monolingual-Second (PFMS) data mixing and demonstrates that pretraining Gemma2-9B with PFMS, followed by translation-focused finetuning, yields GemmaX2-28-9B, a open-model translator with translation quality rivaling GPT-4-turbo and Google Translate on many directions. The work provides a practical data recipe and releases GemmaX2-28-9B, highlighting the potential of open LLMs in scalable multilingual MT while outlining limitations and avenues for future scale-up and language expansion.

Abstract

Large language models (LLMs) have shown continuously improving multilingual capabilities, and even small-scale open-source models have demonstrated rapid performance enhancement. In this paper, we systematically explore the abilities of open LLMs with less than ten billion parameters to handle multilingual machine translation (MT) tasks. We conduct comprehensive evaluations on six popular LLMs and find that models like Gemma2-9B exhibit impressive multilingual translation capabilities. We then introduce the Parallel-First Monolingual-Second (PFMS) data mixing strategy in the continual pretraining stage to further enhance the MT performance and present GemmaX2-28, a 9B model achieving top-tier multilingual translation performance across 28 languages. Specifically, GemmaX2-28 consistently outperforms the state-of-the-art (SOTA) models such as TowerInstruct and XALMA and achieves competitive performance with Google Translate and GPT-4-turbo.

Paper Structure

This paper contains 22 sections, 1 equation, 5 figures, 12 tables.

Figures (5)

  • Figure 1: The tokenizer efficiency of open-source LLMs for each non-English language. The smaller the length ratio is, the more efficient the tokenizer is.
  • Figure 2: MT performance on the FLORES-200 benchmark with different numbers of in-context exemplars.
  • Figure 3: Number of sentences in different languages for Chinese-centric and English-centric parallel dataset.
  • Figure 4: The translation performance (COMET) of models trained with different data recipes during continual pretraining on low-resource (left), mid-resource (middle), and high-resource (right) languages. The upper subfigures illustrate the en$\rightarrow$xx translation performance, while the lower subfigures depict the xx$\rightarrow$en translation performance. Note that "Gemma2-9B" refers to the direct finetuning of the model without continual pretraining, and its performance is reflected in the right-hand y-axis. The translation performance in BLEU scores is illustrated in Figure \ref{['fig:ratio_bleu']}.
  • Figure 5: The translation performance (BLEU) of models trained with different data recipes during continual pretraining on low-resource (left), mid-resource (middle), and high-resource (right) languages. The upper subfigures illustrate the en$\rightarrow$xx translation performance, while the lower subfigures depict the xx$\rightarrow$en translation performance. Note that "Gemma2-9B" refers to the direct finetuning of the model without continual pretraining, and its performance is reflected in the right-hand y-axis.