Table of Contents
Fetching ...

What Drives Performance in Multilingual Language Models?

Sina Bagheri Nezhad, Ameeta Agrawal

TL;DR

This work investigates what drives performance in multilingual language models by evaluating six model families (MLMs, autoregressive, and instruction-tuned) on the SIB-200 dataset spanning 204 languages. Using decision-tree analysis across ALL, SEEN, and UNSEEN language groups, it shows that pretraining data size primarily governs SEEN-language performance, whereas script type and language family become the main predictors for UNSEEN languages, with model size/architecture playing a minor role. The study leverages four factors—pretraining data size, general resource availability, language family, and script—and 93 experimental settings to reveal robust patterns in cross-lingual transfer and resource disparities. These findings have practical import for developing more equitable multilingual NLP systems, suggesting data-centric and linguistically informed strategies over purely scaling models. Overall, the work highlights the limits of model size as a sole driver of multilingual performance and points to data distribution and linguistic typology as key levers for broader language coverage.

Abstract

This study investigates the factors influencing the performance of multilingual large language models (MLLMs) across diverse languages. We study 6 MLLMs, including masked language models, autoregressive models, and instruction-tuned LLMs, on the SIB-200 dataset, a topic classification dataset encompassing 204 languages. Our analysis considers three scenarios: ALL languages, SEEN languages (present in the model's pretraining data), and UNSEEN languages (not present or documented in the model's pretraining data in any meaningful way). We examine the impact of factors such as pretraining data size, general resource availability, language family, and script type on model performance. Decision tree analysis reveals that pretraining data size is the most influential factor for SEEN languages. However, interestingly, script type and language family are crucial for UNSEEN languages, highlighting the importance of cross-lingual transfer learning. Notably, model size and architecture do not significantly alter the most important features identified. Our findings provide valuable insights into the strengths and limitations of current MLLMs and hope to guide the development of more effective and equitable multilingual NLP systems.

What Drives Performance in Multilingual Language Models?

TL;DR

This work investigates what drives performance in multilingual language models by evaluating six model families (MLMs, autoregressive, and instruction-tuned) on the SIB-200 dataset spanning 204 languages. Using decision-tree analysis across ALL, SEEN, and UNSEEN language groups, it shows that pretraining data size primarily governs SEEN-language performance, whereas script type and language family become the main predictors for UNSEEN languages, with model size/architecture playing a minor role. The study leverages four factors—pretraining data size, general resource availability, language family, and script—and 93 experimental settings to reveal robust patterns in cross-lingual transfer and resource disparities. These findings have practical import for developing more equitable multilingual NLP systems, suggesting data-centric and linguistically informed strategies over purely scaling models. Overall, the work highlights the limits of model size as a sole driver of multilingual performance and points to data distribution and linguistic typology as key levers for broader language coverage.

Abstract

This study investigates the factors influencing the performance of multilingual large language models (MLLMs) across diverse languages. We study 6 MLLMs, including masked language models, autoregressive models, and instruction-tuned LLMs, on the SIB-200 dataset, a topic classification dataset encompassing 204 languages. Our analysis considers three scenarios: ALL languages, SEEN languages (present in the model's pretraining data), and UNSEEN languages (not present or documented in the model's pretraining data in any meaningful way). We examine the impact of factors such as pretraining data size, general resource availability, language family, and script type on model performance. Decision tree analysis reveals that pretraining data size is the most influential factor for SEEN languages. However, interestingly, script type and language family are crucial for UNSEEN languages, highlighting the importance of cross-lingual transfer learning. Notably, model size and architecture do not significantly alter the most important features identified. Our findings provide valuable insights into the strengths and limitations of current MLLMs and hope to guide the development of more effective and equitable multilingual NLP systems.
Paper Structure (11 sections, 5 figures, 2 tables)

This paper contains 11 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Distribution of resource levels in SIB-200.
  • Figure 2: Decision tree for Bloom-560m (zero-shot, SEEN languages). "General resource level" emerges as the most important feature, with a significant performance difference between languages above and below the 2.5 threshold ($p$ < 0.001 as per Mann-Whitney U test).
  • Figure 3: F1 Score vs. model-specific pretraining data (percentage) for GPT-3.5, mBERT and XLM-R models.
  • Figure 4: Distribution of language family in SIB-200.
  • Figure 5: Distribution of scripts in SIB-200.