A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias
Yuemei Xu, Ling Hu, Jiayi Zhao, Zihan Qiu, Kexin XU, Yuqi Ye, Hanwen Gu
TL;DR
This survey analyzes how multilingual LLMs are shaped by training corpora, alignment strategies, and biases. It documents evolution from monolingual LLMs to large-scale multilingual models, outlines transformer-based architectures, pre-training, and RLHF, and surveys corpora and datasets that enable cross-lingual transfer. It reviews static, contextual, and combined multilingual representations, factors affecting alignment, and biases across languages with debiasing approaches and benchmarks. It highlights challenges like English dominance, the curse of multilinguality, scarce multilingual bias benchmarks, and emphasizes future directions including better low-resource language coverage, multilingual evaluation, and ethical considerations.
Abstract
Based on the foundation of Large Language Models (LLMs), Multilingual LLMs (MLLMs) have been developed to address the challenges faced in multilingual natural language processing, hoping to achieve knowledge transfer from high-resource languages to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolutions, key techniques, and multilingual capacities. Secondly, we explore the multilingual training corpora of MLLMs and the multilingual datasets oriented for downstream tasks that are crucial to enhance the cross-lingual capability of MLLMs. Thirdly, we survey the state-of-the-art studies of multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs, including its categories, evaluation metrics, and debiasing techniques. Finally, we discuss existing challenges and point out promising research directions of MLLMs.
