Table of Contents
Fetching ...

Exploring the Maze of Multilingual Modeling

Sina Bagheri Nezhad, Ameeta Agrawal

TL;DR

The work addresses the problem of understanding why multilingual language processing performance varies across languages with different typologies. It adopts a holistic evaluation of mBERT, XLM-R, and GPT-3 on text classification (SIB-200) and text generation (mBBC) across dozens of languages, examining pretraining data size, general resource availability, language family, script type, and word order; analyses rely on decision-tree methods and non-parametric tests, with a newly introduced mBBC dataset to test unseen data. The study contributes a cross-model, cross-task perspective that reveals general resource availability, language family, and script type as influential factors, along with model-specific patterns. The findings have practical implications for designing inclusive multilingual systems and for selecting appropriate datasets and models in diverse linguistic contexts.

Abstract

Multilingual language models have gained significant attention in recent years, enabling the development of applications that meet diverse linguistic contexts. In this paper, we present a comprehensive evaluation of three popular multilingual language models: mBERT, XLM-R, and GPT-3. We assess their performance across a diverse set of languages, with a focus on understanding the impact of resource availability (general and model-specific), language family, script type, and word order on model performance, under two distinct tasks - text classification and text generation. Our findings reveal that while the amount of language-specific pretraining data plays a crucial role in model performance, we also identify other factors such as general resource availability, language family, and script type, as important features. We hope that our study contributes to a deeper understanding of multilingual language models to enhance their performance across languages and linguistic contexts.

Exploring the Maze of Multilingual Modeling

TL;DR

The work addresses the problem of understanding why multilingual language processing performance varies across languages with different typologies. It adopts a holistic evaluation of mBERT, XLM-R, and GPT-3 on text classification (SIB-200) and text generation (mBBC) across dozens of languages, examining pretraining data size, general resource availability, language family, script type, and word order; analyses rely on decision-tree methods and non-parametric tests, with a newly introduced mBBC dataset to test unseen data. The study contributes a cross-model, cross-task perspective that reveals general resource availability, language family, and script type as influential factors, along with model-specific patterns. The findings have practical implications for designing inclusive multilingual systems and for selecting appropriate datasets and models in diverse linguistic contexts.

Abstract

Multilingual language models have gained significant attention in recent years, enabling the development of applications that meet diverse linguistic contexts. In this paper, we present a comprehensive evaluation of three popular multilingual language models: mBERT, XLM-R, and GPT-3. We assess their performance across a diverse set of languages, with a focus on understanding the impact of resource availability (general and model-specific), language family, script type, and word order on model performance, under two distinct tasks - text classification and text generation. Our findings reveal that while the amount of language-specific pretraining data plays a crucial role in model performance, we also identify other factors such as general resource availability, language family, and script type, as important features. We hope that our study contributes to a deeper understanding of multilingual language models to enhance their performance across languages and linguistic contexts.
Paper Structure (16 sections, 18 figures, 3 tables)

This paper contains 16 sections, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Distribution of resource level in SIB-200 and mBBC datasets.
  • Figure 2: Distribution of language family in SIB-200 and mBBC datasets.
  • Figure 3: Decision tree visualization. Value refers to the expected F1 score/accuracy of the model.
  • Figure 4: Correlation analysis between performance and pretraining data (train tokens)
  • Figure 5: Model results across different resource levels
  • ...and 13 more figures