EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
Shaoxiong Ji, Zihao Li, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O'Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, Barry Haddow
TL;DR
EMMA-500 demonstrates that large-scale continual pre-training on an expansive, diverse multilingual corpus can meaningfully expand language coverage and cross-lingual transfer, especially for low-resource languages. The MaLA corpus provides a scalable, multi-domain foundation (code, scientific text, books, and instructions) with robust preprocessing (normalization, script handling, deduplication) to support effective CPT on Llama 2 7B. Across intrinsic and a wide range of downstream multilingual benchmarks, EMMA-500 achieves competitive or leading performance in translation, classification, commonsense reasoning, NLI, and more, while transparently releasing datasets, weights, and tooling. Limitations remain in math, machine reading comprehension, and some high-resource language gaps, motivating future work with newer bases, targeted task tuning, and enhanced evaluation frameworks for multilingual safety and alignment.
Abstract
In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability. We release the MaLA corpus, EMMA-500 model weights, scripts, and model generations.
