Multilingual Large Language Models and Curse of Multilinguality
Daniil Gurgurov, Tanja Bäumel, Tatiana Anikina
TL;DR
The paper surveys multilingual large language models by detailing their transformer based architectures, pre training objectives, and data sources, with emphasis on the curse of multilinguality as a core limitation. It catalogs representative encoder only, decoder only, and encoder decoder models (eg, mBERT, XLM-R, mBART, mT5, BLOOM, GPT-3, XGLM, PALM) and discusses their training data, tokenization, and language coverage. The authors discuss modular and adapter based strategies, such as X-MOD, as promising avenues to mitigate cross language interference and improve low resource language performance. The work provides a practical overview of model design choices, data considerations, and prospective methods to enhance cross lingual transfer, aiming to guide researchers and practitioners in building scalable multilingual NLU and NLG systems across diverse languages.
Abstract
Multilingual Large Language Models (LLMs) have gained large popularity among Natural Language Processing (NLP) researchers and practitioners. These models, trained on huge datasets, show proficiency across various languages and demonstrate effectiveness in numerous downstream tasks. This paper navigates the landscape of multilingual LLMs, providing an introductory overview of their technical aspects. It explains underlying architectures, objective functions, pre-training data sources, and tokenization methods. This work explores the unique features of different model types: encoder-only (mBERT, XLM-R), decoder-only (XGLM, PALM, BLOOM, GPT-3), and encoder-decoder models (mT5, mBART). Additionally, it addresses one of the significant limitations of multilingual LLMs - the curse of multilinguality - and discusses current attempts to overcome it.
