Table of Contents
Fetching ...

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André F. T. Martins, François Yvon, Hinrich Schütze

TL;DR

Addressing data scarcity in tail languages, the paper presents Glot500-c and Glot500-m, a horizontally scaled multilingual framework that covers 511 languages. It demonstrates that enlarging language coverage while augmenting vocabulary and continuing pretraining yields substantial gains over XLM-R baselines across a diverse task set. The authors perform an extensive analysis showing that corpus size, scripts, cross-language support, and model capacity jointly influence multilingual representation quality. The work provides public resources to spur research on broad-language NLP and underscores the value of inclusive, low-resource language technology.

Abstract

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, "help" from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should not limit NLP to a small fraction of the world's languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

TL;DR

Addressing data scarcity in tail languages, the paper presents Glot500-c and Glot500-m, a horizontally scaled multilingual framework that covers 511 languages. It demonstrates that enlarging language coverage while augmenting vocabulary and continuing pretraining yields substantial gains over XLM-R baselines across a diverse task set. The authors perform an extensive analysis showing that corpus size, scripts, cross-language support, and model capacity jointly influence multilingual representation quality. The work provides public resources to spur research on broad-language NLP and underscores the value of inclusive, low-resource language technology.

Abstract

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, "help" from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should not limit NLP to a small fraction of the world's languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.
Paper Structure (33 sections, 5 equations, 1 figure, 25 tables)

This paper contains 33 sections, 5 equations, 1 figure, 25 tables.

Figures (1)

  • Figure 1: Progression of training for sentence retrieval and sequence labeling. x-axis: epochs/10K. The improvement is fast in the beginning for tail languages, then gets slower and and reaches a plateau. This pattern is partially observed for head languages.