Table of Contents
Fetching ...

When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen

TL;DR

This study systematically probes how multilingual pre-training distributions shape language modeling performance across 252 languages by exhaustively varying monolingual data size, multilingual data size, model capacity, and linguistic similarity. Using fixed target-language tokenizers and a large corpus of 41.4B tokens, the authors reveal that moderate multilingual data can boost low-resource language modeling—roughly equivalent to a substantial gain in monolingual data—particularly when added languages are syntactically similar. In contrast, high-resource languages consistently deteriorate under multilingual pre-training, and capacity limits exacerbate negative interference as multilingual data grows. The findings advocate for targeted multilingual models—prioritizing language similarity and sufficient capacity—over all-encompassing massively multilingual pre-training, with practical implications for data collection and model design in multilingual NLP.

Abstract

Multilingual language models are widely used to extend NLP systems to low-resource languages. However, concrete evidence for the effects of multilinguality on language modeling performance in individual languages remains scarce. Here, we pre-train over 10,000 monolingual and multilingual language models for over 250 languages, including multiple language families that are under-studied in NLP. We assess how language modeling performance in each language varies as a function of (1) monolingual dataset size, (2) added multilingual dataset size, (3) linguistic similarity of the added languages, and (4) model size (up to 45M parameters). We find that in moderation, adding multilingual data improves low-resource language modeling performance, similar to increasing low-resource dataset sizes by up to 33%. Improvements depend on the syntactic similarity of the added multilingual data, with marginal additional effects of vocabulary overlap. However, high-resource languages consistently perform worse in multilingual pre-training scenarios. As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages, likely due to limited model capacity (the "curse of multilinguality"). These results suggest that massively multilingual pre-training may not be optimal for any languages involved, but that more targeted models can significantly improve performance.

When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages

TL;DR

This study systematically probes how multilingual pre-training distributions shape language modeling performance across 252 languages by exhaustively varying monolingual data size, multilingual data size, model capacity, and linguistic similarity. Using fixed target-language tokenizers and a large corpus of 41.4B tokens, the authors reveal that moderate multilingual data can boost low-resource language modeling—roughly equivalent to a substantial gain in monolingual data—particularly when added languages are syntactically similar. In contrast, high-resource languages consistently deteriorate under multilingual pre-training, and capacity limits exacerbate negative interference as multilingual data grows. The findings advocate for targeted multilingual models—prioritizing language similarity and sufficient capacity—over all-encompassing massively multilingual pre-training, with practical implications for data collection and model design in multilingual NLP.

Abstract

Multilingual language models are widely used to extend NLP systems to low-resource languages. However, concrete evidence for the effects of multilinguality on language modeling performance in individual languages remains scarce. Here, we pre-train over 10,000 monolingual and multilingual language models for over 250 languages, including multiple language families that are under-studied in NLP. We assess how language modeling performance in each language varies as a function of (1) monolingual dataset size, (2) added multilingual dataset size, (3) linguistic similarity of the added languages, and (4) model size (up to 45M parameters). We find that in moderation, adding multilingual data improves low-resource language modeling performance, similar to increasing low-resource dataset sizes by up to 33%. Improvements depend on the syntactic similarity of the added multilingual data, with marginal additional effects of vocabulary overlap. However, high-resource languages consistently perform worse in multilingual pre-training scenarios. As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages, likely due to limited model capacity (the "curse of multilinguality"). These results suggest that massively multilingual pre-training may not be optimal for any languages involved, but that more targeted models can significantly improve performance.
Paper Structure (30 sections, 1 equation, 10 figures, 1 table)

This paper contains 30 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: Left: Map of the 252 languages used in our study. Right: Effects of adding multilingual pre-training data in similar languages, for low-resource (1M token) through high-resource (1B token) languages in small models. Effects are quantified using the estimated monolingual dataset size that would achieve similar performance. Adding 1B tokens of multilingual data is similar to adding $22\%$ (low-resource) or removing $63\%$ (high-resource) of the monolingual dataset. Shaded regions are 99% confidence intervals for the mean.
  • Figure 2: Curves predicting monolingual model performance from dataset size. Left: Curves fitted to all languages for each model size. Bold lines are fitted curves, and lighter lines are ground truth curves for individual languages. Right: Sample language-specific curves for small models, extrapolating from only two data points (1M and 10M tokens). This still produces reasonable estimates for 100M and 1B tokens. Bold lines are estimated curves, and dashed lines are ground truth values.
  • Figure 3: Results for low and med-low resource scenarios. Higher $y$-axis values indicate better performance. For example, a small model with 1M monolingual tokens (top right) and 1B added tokens of multilingual data in similar languages has similar performance to 1.2M monolingual tokens alone. Light-colored lines indicate results for individual languages, and bold lines indicate the mean across languages. Shaded regions are 95% confidence intervals for the mean.
  • Figure 4: Left: Correlation between the mean syntactic similarity of the added languages and a model's relative log-likelihood score for the target language (Pearson's $r=0.494$). Added languages are selected to be either similar or dissimilar (§\ref{['sec:multilingual-models']}). A relative log-likelihood of $1.0$ indicates that the model assigns the eval dataset $2^{1.0}$ times the likelihood assigned by the baseline model for that language. Center: Correlation ($r=0.346$) between the mean lexical (vocabulary) similarity of the added languages and a model's relative log-likelihood score. Right: Variance partitioning into syntactic, geographic, and lexical similarity of the added languages when predicting a model's relative log-likelihood score. Additional results in §\ref{['app:similarity-correlations']}.
  • Figure 5: Results for med-high and high resource scenarios, using the same format as the low-resource scenarios in Figure \ref{['fig:low-results']}. For example, adding 1B tokens of multilingual data to a small model with 1B monolingual tokens (high-resource; bottom right) is similar to removing over 600M tokens of the monolingual dataset.
  • ...and 5 more figures