A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

Yuemei Xu; Ling Hu; Jiayi Zhao; Zihan Qiu; Kexin XU; Yuqi Ye; Hanwen Gu

A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

Yuemei Xu, Ling Hu, Jiayi Zhao, Zihan Qiu, Kexin XU, Yuqi Ye, Hanwen Gu

TL;DR

This survey analyzes how multilingual LLMs are shaped by training corpora, alignment strategies, and biases. It documents evolution from monolingual LLMs to large-scale multilingual models, outlines transformer-based architectures, pre-training, and RLHF, and surveys corpora and datasets that enable cross-lingual transfer. It reviews static, contextual, and combined multilingual representations, factors affecting alignment, and biases across languages with debiasing approaches and benchmarks. It highlights challenges like English dominance, the curse of multilinguality, scarce multilingual bias benchmarks, and emphasizes future directions including better low-resource language coverage, multilingual evaluation, and ethical considerations.

Abstract

Based on the foundation of Large Language Models (LLMs), Multilingual LLMs (MLLMs) have been developed to address the challenges faced in multilingual natural language processing, hoping to achieve knowledge transfer from high-resource languages to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolutions, key techniques, and multilingual capacities. Secondly, we explore the multilingual training corpora of MLLMs and the multilingual datasets oriented for downstream tasks that are crucial to enhance the cross-lingual capability of MLLMs. Thirdly, we survey the state-of-the-art studies of multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs, including its categories, evaluation metrics, and debiasing techniques. Finally, we discuss existing challenges and point out promising research directions of MLLMs.

A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

TL;DR

Abstract

Paper Structure (28 sections, 2 equations, 9 figures, 7 tables)

This paper contains 28 sections, 2 equations, 9 figures, 7 tables.

Introduction
Overview of MLLMs
Evolution of MLLMs
Monolingual Evolution
Multilingual Evolution
Key Techniques of MLLMs
Transformer Architecture
Pre-training Technique
Reinforcement Learning with Human Feedback
Multilingual Capacities of MLLMs
Challenges brought by Multilingual Corpora
Cross-lingual Transfer Learning brought by Multilingual Corpora
Multilingual Corpora and Datasets
Multilingual Corpora in MLLMs
Multilingual Datasets for Downstream Tasks
...and 13 more sections

Figures (9)

Figure 1: An illustration of the relationship between corpora, misalignment, and bias. The misalignment and bias produced by MLLM arise in part from the bias and imbalanced language proportions of the training corpora.
Figure 2: An illustration of the evolution roadmap of current multilingual LLMs, presenting their release year, the number of supported languages and release relationship. 'Unknown' indicates the model has not disclosed the language proportion in its training data.
Figure 3: Diagram illustrating the RLHF procedure, which consists of three key steps: (1) Pre-training a LM using the labeled prompt-response dataset, (2) Training a Reward Model based on scores provided by human evaluators for LM's generation, and (3) Fine-tuning with a Reinforcement Learning (RL) algorithm, which helps to update parameters in the LM based on the feedback from RM.
Figure 4: This analysis excludes English and focuses on ratios of language families of languages (top 20) in MLLM's corpora. Note that Gopher only released the top 10 languages and FuxiTranyu only released the top 13 languages used in training corpora. What's more, some of the latest models like GPT-4 have not disclosed the proportion of their training data, so they aren't included in the chart.
Figure 5: Taxonomy of multilingual representation alignment that consists of static, contextual, and combined approaches. In addition, we also summarize the factors that affect alignments.
...and 4 more figures

A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

TL;DR

Abstract

A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

Authors

TL;DR

Abstract

Table of Contents

Figures (9)