Table of Contents
Fetching ...

FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation

Shaolin Zhu, Tianyu Dong, Bo Li, Deyi Xiong

TL;DR

FuxiMT tackles the limited extent of Chinese-centric multilingual MT by integrating a sparsified BLOOMz LLM with Mixture-of-Experts into a decoder, guided by a two-stage training regime. It first performs Chinese-centric pre-training on 5B tokens and then multilingual fine-tuning on over 100B parallel sentences across 65 languages, using curriculum learning and back-translation. The model achieves substantial gains over strong baselines, particularly in low-resource and zero-shot settings, demonstrating effective cross-lingual transfer while maintaining efficiency through MoEs. This work suggests a viable path to bridge linguistic gaps by preserving a frozen backbone and routing inputs through specialized experts across a broad language space.

Abstract

In this paper, we present FuxiMT, a novel Chinese-centric multilingual machine translation model powered by a sparsified large language model (LLM). We adopt a two-stage strategy to train FuxiMT. We first pre-train the model on a massive Chinese corpus and then conduct multilingual fine-tuning on a large parallel dataset encompassing 65 languages. FuxiMT incorporates Mixture-of-Experts (MoEs) and employs a curriculum learning strategy for robust performance across various resource levels. Experimental results demonstrate that FuxiMT significantly outperforms strong baselines, including state-of-the-art LLMs and machine translation models, particularly under low-resource scenarios. Furthermore, FuxiMT exhibits remarkable zero-shot translation capabilities for unseen language pairs, indicating its potential to bridge communication gaps where parallel data are scarce or unavailable.

FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation

TL;DR

FuxiMT tackles the limited extent of Chinese-centric multilingual MT by integrating a sparsified BLOOMz LLM with Mixture-of-Experts into a decoder, guided by a two-stage training regime. It first performs Chinese-centric pre-training on 5B tokens and then multilingual fine-tuning on over 100B parallel sentences across 65 languages, using curriculum learning and back-translation. The model achieves substantial gains over strong baselines, particularly in low-resource and zero-shot settings, demonstrating effective cross-lingual transfer while maintaining efficiency through MoEs. This work suggests a viable path to bridge linguistic gaps by preserving a frozen backbone and routing inputs through specialized experts across a broad language space.

Abstract

In this paper, we present FuxiMT, a novel Chinese-centric multilingual machine translation model powered by a sparsified large language model (LLM). We adopt a two-stage strategy to train FuxiMT. We first pre-train the model on a massive Chinese corpus and then conduct multilingual fine-tuning on a large parallel dataset encompassing 65 languages. FuxiMT incorporates Mixture-of-Experts (MoEs) and employs a curriculum learning strategy for robust performance across various resource levels. Experimental results demonstrate that FuxiMT significantly outperforms strong baselines, including state-of-the-art LLMs and machine translation models, particularly under low-resource scenarios. Furthermore, FuxiMT exhibits remarkable zero-shot translation capabilities for unseen language pairs, indicating its potential to bridge communication gaps where parallel data are scarce or unavailable.

Paper Structure

This paper contains 29 sections, 5 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Language and data distribution in the pre-training data of FuxiMT.
  • Figure 2: Diagram of FuxiMT. FuxiMT is built upon BLOOMz-7B and fine-tuned on translation and general tasks.
  • Figure 3: Results on zero-shot language pairs