Table of Contents
Fetching ...

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

Lingfeng Ming, Bo Zeng, Chenyang Lyu, Tianqi Shi, Yu Zhao, Xue Yang, Yefeng Liu, Yiyu Wang, Linlong Xu, Yangyang Liu, Xiaohu Zhao, Hao Wang, Heng Liu, Hao Zhou, Huifeng Yin, Zifu Shang, Haijun Li, Longyue Wang, Weihua Luo, Kaifu Zhang

TL;DR

Marco-LLM addresses the multilingual performance gap in large language models by building a massive multilingual LLM through two-stage continual pretraining on the Qwen2 base, then applying extensive multilingual post-training that includes supervised fine-tuning and preference alignment. The approach combines diverse data sources (web, parallel, knowledge, synthetic) with careful filtering and a staged learning-rate strategy to balance adaptation and forgetting across languages, especially low-resource ones. Evaluations across MMMLU, Flores, Belebele, CEVAL, TyDiQA, and other benchmarks show Marco-LLM achieving strong multilingual performance, often surpassing strong open-source baselines and maintaining robust English/Chinese capabilities. The results demonstrate the practical potential of targeted multilingual data curation and staged training to extend LLM capabilities across a broad language spectrum, paving the way for more inclusive, cross-lingual AI systems.

Abstract

Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

TL;DR

Marco-LLM addresses the multilingual performance gap in large language models by building a massive multilingual LLM through two-stage continual pretraining on the Qwen2 base, then applying extensive multilingual post-training that includes supervised fine-tuning and preference alignment. The approach combines diverse data sources (web, parallel, knowledge, synthetic) with careful filtering and a staged learning-rate strategy to balance adaptation and forgetting across languages, especially low-resource ones. Evaluations across MMMLU, Flores, Belebele, CEVAL, TyDiQA, and other benchmarks show Marco-LLM achieving strong multilingual performance, often surpassing strong open-source baselines and maintaining robust English/Chinese capabilities. The results demonstrate the practical potential of targeted multilingual data curation and staged training to extend LLM capabilities across a broad language spectrum, paving the way for more inclusive, cross-lingual AI systems.

Abstract

Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.

Paper Structure

This paper contains 60 sections, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Comparison of English-centric performance vs Multilingual performance on MMMLU and Flores. Our Marco-LLM demonstrates strong performance on both dimensions.
  • Figure 2: An overview of the training and evaluation paradigm of our Marco-LLM, we conducted massive multilingual continual pre-training, multilingual supervised finetuning and preference alignment. We further perform extensive evaluation on multilingual benchmarks to validate the efficacy of our Marco-LLM.
  • Figure 3: The amount of tokens per category in our multilingual continual pretraining corpus for Marco-LLM.
  • Figure 4: Evolution of performance on question answering and machine translation during continual pretraining in Marco-1.5B.
  • Figure 5: The performance of different model size on Flores benchmark. Marco-w/o-parallel-data-filtering denotes that we continuously pre-trained Marco-LLM based on Qwen2 without applying any filtering to parallel data.
  • ...and 3 more figures