Table of Contents
Fetching ...

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

Terra Blevins, Tomasz Limisiewicz, Suchin Gururangan, Margaret Li, Hila Gonen, Noah A. Smith, Luke Zettlemoyer

TL;DR

X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting, and training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.

Abstract

Despite their popularity in non-English NLP, multilingual language models often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert Language Models (X-ELM), which mitigate this competition by independently training language models on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while remaining effective as a multilingual ensemble. Our experiments show that when given the same compute budget, X-ELM outperforms jointly trained multilingual models across all considered languages and that these gains transfer to downstream tasks. X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting. Furthermore, training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

TL;DR

X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting, and training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.

Abstract

Despite their popularity in non-English NLP, multilingual language models often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert Language Models (X-ELM), which mitigate this competition by independently training language models on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while remaining effective as a multilingual ensemble. Our experiments show that when given the same compute budget, X-ELM outperforms jointly trained multilingual models across all considered languages and that these gains transfer to downstream tasks. X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting. Furthermore, training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.
Paper Structure (47 sections, 1 equation, 7 figures, 11 tables)

This paper contains 47 sections, 1 equation, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Overview of the x-elm pretraining procedure. Left: We partition the multilingual text corpus into $k$ subsets either through automatic TF-IDF clustering of documents or through grouping languages by linguistic typology. Center: Branch-Train-Merge (BTM) pretraining method. We initialize (branch) $k$ experts from a seed LM, train each expert on a different cluster from the pretraining corpus, and merge the experts into a set of x-elms. Right: Hierarchical Multi-Round (HMR) training procedure (§ \ref{['sec:hmr-training']}).
  • Figure 2: Hierarchical clustering of languages used to train our x-elm ensembles.
  • Figure 3: Average and language-specific (EN and SW) perplexities across different expert counts $k$ (Num. Experts). In each evaluation setting, we compare clustering training data for experts with the TF-IDF$_{top1}$ (square) and Linguistic Typology (triangle) methods (§ \ref{['sec:method-data']}). The best choice of $k$ for each setting is marked with a star.
  • Figure 4: PPL improvements per language over XGLM-1.7B (circle) and dense baseline (triangle) against the training data quantity (for typ. clustered experts).
  • Figure 5: Percentage of language data assigned to different experts with TF-IDF (top row) and Typ. (bottom row) clustering. For Typ. clustering, each language is assigned entirely to a single expert.
  • ...and 2 more figures