Table of Contents
Fetching ...

The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments

Anton Schäfer, Shauli Ravfogel, Thomas Hofmann, Tiago Pimentel, Imanol Schlag

TL;DR

The paper investigates cross-lingual generalisation in multilingual language models and identifies language imbalance as a novel driver that can improve transfer between languages. Through controlled experiments on perfectly equivalent cloned languages, it shows that training with a dominant language boosts weaker languages and increases representation and gradient alignment, with effects amplified by larger models and longer training, and proposes curricula to exploit this without changing data. When extending to real languages (English and French), the benefits of imbalance persist for low-resource languages but are weaker and less clearly tied to representation alignment; anchored vocabularies improve cross-lingual transfer in real-language settings. Overall, the findings suggest that training dynamics can foster circuit sharing across languages under imbalance, though real-world generalisation remains context-dependent and warrants further research, with practical implications for curriculum design and vocabulary strategies in multilingual LMs.

Abstract

Multilinguality is crucial for extending recent advancements in language modelling to diverse linguistic communities. To maintain high performance while representing multiple languages, multilingual models ideally align representations, allowing what is learned in one language to generalise to others. Prior research has emphasised the importance of parallel data and shared vocabulary elements as key factors for such alignment. In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance. In controlled experiments on perfectly equivalent cloned languages, we observe that the existence of a predominant language during training boosts the performance of less frequent languages and leads to stronger alignment of model representations across languages. Furthermore, we find that this trend is amplified with scale: with large enough models or long enough training, we observe that bilingual training data with a 90/10 language split yields better performance on both languages than a balanced 50/50 split. Building on these insights, we design training schemes that can improve performance in all cloned languages, even without altering the training data. As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.

The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments

TL;DR

The paper investigates cross-lingual generalisation in multilingual language models and identifies language imbalance as a novel driver that can improve transfer between languages. Through controlled experiments on perfectly equivalent cloned languages, it shows that training with a dominant language boosts weaker languages and increases representation and gradient alignment, with effects amplified by larger models and longer training, and proposes curricula to exploit this without changing data. When extending to real languages (English and French), the benefits of imbalance persist for low-resource languages but are weaker and less clearly tied to representation alignment; anchored vocabularies improve cross-lingual transfer in real-language settings. Overall, the findings suggest that training dynamics can foster circuit sharing across languages under imbalance, though real-world generalisation remains context-dependent and warrants further research, with practical implications for curriculum design and vocabulary strategies in multilingual LMs.

Abstract

Multilinguality is crucial for extending recent advancements in language modelling to diverse linguistic communities. To maintain high performance while representing multiple languages, multilingual models ideally align representations, allowing what is learned in one language to generalise to others. Prior research has emphasised the importance of parallel data and shared vocabulary elements as key factors for such alignment. In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance. In controlled experiments on perfectly equivalent cloned languages, we observe that the existence of a predominant language during training boosts the performance of less frequent languages and leads to stronger alignment of model representations across languages. Furthermore, we find that this trend is amplified with scale: with large enough models or long enough training, we observe that bilingual training data with a 90/10 language split yields better performance on both languages than a balanced 50/50 split. Building on these insights, we design training schemes that can improve performance in all cloned languages, even without altering the training data. As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
Paper Structure (33 sections, 1 equation, 14 figures, 4 tables)

This paper contains 33 sections, 1 equation, 14 figures, 4 tables.

Figures (14)

  • Figure 1: LM performance by imbalance ratio. (top) LM perplexity. (bottom) LM accuracy on GLUE; models were fine-tuned in $\mathrm{EN}_1$ and evaluated on either $\mathrm{EN}_1$ and $\mathrm{EN}_2$.
  • Figure 2: $\mathrm{TEff}\xspace$ as we train LMs with (left) more data, or (right) larger architectures. mini, small and medium denote GPT sizes in Languini stanic2023languini, with 11M, 85M, and 303M non-embedding parameters.
  • Figure 3: LM performance on $\mathrm{EN}$ and $\mathrm{FR}$ by imbalance ratio.
  • Figure 4: $\mathrm{TEff}\xspace$ of models on $\mathrm{EN}$ and $\mathrm{FR}$ with anchored vocab as we train them with (left) more data, or (right) larger architectures.
  • Figure 5: Fitted power laws curves predicting perplexity depending on the fraction of training tokens (compared to our standard 1.2B tokens) for different languages and model sizes.
  • ...and 9 more figures