mALBERT: Is a Compact Multilingual BERT Model Still Worth It?

Christophe Servan; Sahar Ghannay; Sophie Rosset

mALBERT: Is a Compact Multilingual BERT Model Still Worth It?

Christophe Servan, Sahar Ghannay, Sophie Rosset

TL;DR

This article proposes the free release of the first version of a multilingual compact ALBERT model, pre-trained using Wikipedia data, which complies with the ethical aspect of such a language model.

Abstract

Within the current trend of Pretained Language Models (PLM), emerge more and more criticisms about the ethical andecological impact of such models. In this article, considering these critical remarks, we propose to focus on smallermodels, such as compact models like ALBERT, which are more ecologically virtuous than these PLM. However,PLMs enable huge breakthroughs in Natural Language Processing tasks, such as Spoken and Natural LanguageUnderstanding, classification, Question--Answering tasks. PLMs also have the advantage of being multilingual, and,as far as we know, a multilingual version of compact ALBERT models does not exist. Considering these facts, wepropose the free release of the first version of a multilingual compact ALBERT model, pre-trained using Wikipediadata, which complies with the ethical aspect of such a language model. We also evaluate the model against classicalmultilingual PLMs in classical NLP tasks. Finally, this paper proposes a rare study on the subword tokenizationimpact on language performances.

mALBERT: Is a Compact Multilingual BERT Model Still Worth It?

TL;DR

This article proposes the free release of the first version of a multilingual compact ALBERT model, pre-trained using Wikipedia data, which complies with the ethical aspect of such a language model.

Abstract

Paper Structure (13 sections, 1 figure, 4 tables)

This paper contains 13 sections, 1 figure, 4 tables.

Introduction
Model Pre-training
Data
Subword unit
Training parameters
Experiments
Slot-filling benchmark
Classification benchmark
Tokenization Impact
Conclusion
Acknowledgements
Bibliographical References
Language Resource References

Figures (1)

Figure 1: Language distribution (52 languages) over the training corpus. In the legend, languages are presented according to their representativity: from left to right and from up and down. The most representative language is English (en) and the least one is Amharic (am)

mALBERT: Is a Compact Multilingual BERT Model Still Worth It?

TL;DR

Abstract

mALBERT: Is a Compact Multilingual BERT Model Still Worth It?

Authors

TL;DR

Abstract

Table of Contents

Figures (1)