Table of Contents
Fetching ...

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Shaoxiong Ji, Zihao Li, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O'Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, Barry Haddow

TL;DR

EMMA-500 demonstrates that large-scale continual pre-training on an expansive, diverse multilingual corpus can meaningfully expand language coverage and cross-lingual transfer, especially for low-resource languages. The MaLA corpus provides a scalable, multi-domain foundation (code, scientific text, books, and instructions) with robust preprocessing (normalization, script handling, deduplication) to support effective CPT on Llama 2 7B. Across intrinsic and a wide range of downstream multilingual benchmarks, EMMA-500 achieves competitive or leading performance in translation, classification, commonsense reasoning, NLI, and more, while transparently releasing datasets, weights, and tooling. Limitations remain in math, machine reading comprehension, and some high-resource language gaps, motivating future work with newer bases, targeted task tuning, and enhanced evaluation frameworks for multilingual safety and alignment.

Abstract

In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability. We release the MaLA corpus, EMMA-500 model weights, scripts, and model generations.

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

TL;DR

EMMA-500 demonstrates that large-scale continual pre-training on an expansive, diverse multilingual corpus can meaningfully expand language coverage and cross-lingual transfer, especially for low-resource languages. The MaLA corpus provides a scalable, multi-domain foundation (code, scientific text, books, and instructions) with robust preprocessing (normalization, script handling, deduplication) to support effective CPT on Llama 2 7B. Across intrinsic and a wide range of downstream multilingual benchmarks, EMMA-500 achieves competitive or leading performance in translation, classification, commonsense reasoning, NLI, and more, while transparently releasing datasets, weights, and tooling. Limitations remain in math, machine reading comprehension, and some high-resource language gaps, motivating future work with newer bases, targeted task tuning, and enhanced evaluation frameworks for multilingual safety and alignment.

Abstract

In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability. We release the MaLA corpus, EMMA-500 model weights, scripts, and model generations.
Paper Structure (68 sections, 4 figures, 25 tables)

This paper contains 68 sections, 4 figures, 25 tables.

Figures (4)

  • Figure 1: The number of wins, i.e., the number of times EMMA-500 achieves the best or superior performance compared to other models in the same category across various evaluation tasks and benchmarks. We compare our EMMA-500 Llama 2 7B model to decoder-only LLMs of similar parameter size, including (i) 10 Llama 2-based LLMs, (ii) 7 multilingual LLMs and CPT models, and (iii) 8 recent advanced LLMs (see \ref{['sec:baselines']}) on tasks and benchmarks in \ref{['tab:downstream_tasks']}. If EMMA-500 scores higher than all compared models on a specific benchmark, it is considered a winning case for that particular evaluation. Our EMMA-500 Llama 2 model outperforms most Llama 2-based, multilingual LLMs and CPT models. Remarkably, our model achieves the best performance on Flores200, Glot500-c, and PBC among all the compared baselines.
  • Figure 2: The number of texts and tokens of MaLA corpus in different resource groups.
  • Figure 3: Unicode block distribution that measures the percentage of token counts falling into the Unicode block of each language
  • Figure 5: Writing tasks in the PolyWrite dataset.