Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

Terra Blevins; Tomasz Limisiewicz; Suchin Gururangan; Margaret Li; Hila Gonen; Noah A. Smith; Luke Zettlemoyer

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

Terra Blevins, Tomasz Limisiewicz, Suchin Gururangan, Margaret Li, Hila Gonen, Noah A. Smith, Luke Zettlemoyer

TL;DR

X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting, and training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.

Abstract

Despite their popularity in non-English NLP, multilingual language models often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert Language Models (X-ELM), which mitigate this competition by independently training language models on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while remaining effective as a multilingual ensemble. Our experiments show that when given the same compute budget, X-ELM outperforms jointly trained multilingual models across all considered languages and that these gains transfer to downstream tasks. X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting. Furthermore, training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

TL;DR

Abstract

Paper Structure (47 sections, 1 equation, 7 figures, 11 tables)

This paper contains 47 sections, 1 equation, 7 figures, 11 tables.

Introduction
Background: Branch-Train-Merge
Cross-lingual Expert Language Models
x-BTM: Sparse Multilingual Training
Step 0: Multilingual Data Allocation
Step 1: Branch
Step 2: Train
Step 3: Merge
Data Allocation Methods
Balanced TF-IDF Clustering
Linguistic Typology Clustering
Inference with x-elms
Top-1 Expert
Ensembling TF-IDF Experts
Hierarchical Multi-Round Training
...and 32 more sections

Figures (7)

Figure 1: Overview of the x-elm pretraining procedure. Left: We partition the multilingual text corpus into $k$ subsets either through automatic TF-IDF clustering of documents or through grouping languages by linguistic typology. Center: Branch-Train-Merge (BTM) pretraining method. We initialize (branch) $k$ experts from a seed LM, train each expert on a different cluster from the pretraining corpus, and merge the experts into a set of x-elms. Right: Hierarchical Multi-Round (HMR) training procedure (§ \ref{['sec:hmr-training']}).
Figure 2: Hierarchical clustering of languages used to train our x-elm ensembles.
Figure 3: Average and language-specific (EN and SW) perplexities across different expert counts $k$ (Num. Experts). In each evaluation setting, we compare clustering training data for experts with the TF-IDF$_{top1}$ (square) and Linguistic Typology (triangle) methods (§ \ref{['sec:method-data']}). The best choice of $k$ for each setting is marked with a star.
Figure 4: PPL improvements per language over XGLM-1.7B (circle) and dense baseline (triangle) against the training data quantity (for typ. clustered experts).
Figure 5: Percentage of language data assigned to different experts with TF-IDF (top row) and Typ. (bottom row) clustering. For Typ. clustering, each language is assigned entirely to a single expert.
...and 2 more figures

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

TL;DR

Abstract

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)