Table of Contents
Fetching ...

MaLA-500: Massive Language Adaptation of Large Language Models

Peiqin Lin, Shaoxiong Ji, Jörg Tiedemann, André F. T. Martins, Hinrich Schütze

TL;DR

MaLA-500 tackles the language coverage gap of English-centric LLMs by massively adapting LLaMA-2 with Glot500-c data to support 534 languages. The approach combines vocabulary extension and continued pretraining via LoRA on a scalable 10B model, yielding strong multilingual performance, particularly in low-resource languages, and improved in-context learning on SIB200 and Taxi1500. The authors validate MaLA-500 against several open LLMs, showing superior NLL and accuracy across 534 languages and diverse families, and release the model and code for reproducibility. This work significantly broadens practical access to LLMs for underrepresented languages and sets a foundation for future translation and cross-lingual tasks.

Abstract

Large language models (LLMs) have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages. We release MaLA-500 at https://huggingface.co/MaLA-LM

MaLA-500: Massive Language Adaptation of Large Language Models

TL;DR

MaLA-500 tackles the language coverage gap of English-centric LLMs by massively adapting LLaMA-2 with Glot500-c data to support 534 languages. The approach combines vocabulary extension and continued pretraining via LoRA on a scalable 10B model, yielding strong multilingual performance, particularly in low-resource languages, and improved in-context learning on SIB200 and Taxi1500. The authors validate MaLA-500 against several open LLMs, showing superior NLL and accuracy across 534 languages and diverse families, and release the model and code for reproducibility. This work significantly broadens practical access to LLMs for underrepresented languages and sets a foundation for future translation and cross-lingual tasks.

Abstract

Large language models (LLMs) have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages. We release MaLA-500 at https://huggingface.co/MaLA-LM
Paper Structure (21 sections, 5 figures, 29 tables)

This paper contains 21 sections, 5 figures, 29 tables.

Figures (5)

  • Figure 1: $NLL$ (lower is better) on Glot500-c test with the scores grouped into four bins displayed in different colors. X-axis: the number of languages in performance ranges.
  • Figure 2: Accuracy (higher is better) on SIB200 with the scores grouped into four bins displayed in different colors. X-axis: the number of languages in performance ranges (%).
  • Figure 3: Accuracy (higher is better) on Taxi1500 with the scores grouped into four bins displayed in different colors. X-axis: the number of languages in performance ranges (%).
  • Figure 4: In-context learning macro-average accuracy (%) on SIB200 with different number of shots using MaLA-500.
  • Figure 5: Detailed results of in-context learning on SIB200 using MaLA-500. X-axis: the number of languages in different accuracy ranges (%). Y-axis: number of shots.