Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

Libo Qin; Qiguang Chen; Yuhang Zhou; Zhi Chen; Yinghui Li; Lizi Liao; Min Li; Wanxiang Che; Philip S. Yu

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen, Yinghui Li, Lizi Liao, Min Li, Wanxiang Che, Philip S. Yu

TL;DR

This survey addresses the lack of a unified view for multilingual large language models (MLLMs) by introducing a two-family alignment taxonomy: parameter-tuning alignment (PTA) and parameter-frozen alignment (PFA). It systematically catalogs data resources across pretraining, SFT, and RLHF stages, and provides a detailed synthesis of PTA workflows (pretraining, SFT, RLHF, and downstream finetuning) as well as PFA prompting strategies (direct, code-switching, translation-based, and retrieval-augmented). The paper also highlights emerging frontiers such as hallucination, knowledge editing, safety, fairness, language extension, and multimodality, and offers practical guidance through an organized resources appendix. Overall, the work aims to unify the MLLM literature, clarify methodological options, and accelerate progress by linking data, methods, and evaluation challenges across languages.

Abstract

Multilingual Large Language Models are capable of using powerful Large Language Models to handle and respond to queries in multiple languages, which achieves remarkable success in multilingual natural language processing tasks. Despite these breakthroughs, there still remains a lack of a comprehensive survey to summarize existing approaches and recent developments in this field. To this end, in this paper, we present a thorough review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. The contributions of this paper can be summarized: (1) First survey: to our knowledge, we take the first step and present a thorough review in MLLMs research field according to multi-lingual alignment; (2) New taxonomy: we offer a new and unified perspective to summarize the current progress of MLLMs; (3) New frontiers: we highlight several emerging frontiers and discuss the corresponding challenges; (4) Abundant resources: we collect abundant open-source resources, including relevant papers, data corpora, and leaderboards. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

TL;DR

Abstract

Paper Structure (50 sections, 2 equations, 6 figures, 2 tables)

This paper contains 50 sections, 2 equations, 6 figures, 2 tables.

Introduction
Preliminary
Monolingual Large Language Model
Multilingual Large Language Model
Data Resource
Multilingual Pretraining Data
Multilingual SFT Data
Multilingual RLHF Data
Taxonomy
Parameter-Tuning Alignment
PTA in Pretraining Stage
From-scratch Pretraining Alignment.
Continual Pretraining Alignment.
PTA in SFT Stage
PTA in RLHF Stage
...and 35 more sections

Figures (6)

Figure 1: Parameter-Tuning Alignment ($\S$\ref{['sec:parameter_tuned_align']}) v.s. Parameter-Frozen Alignment ($\S$\ref{['sec:param_freeze_align']}). The former requires the model to fine-tune the MLLM parameters for cross-lingual alignment, while the latter directly uses prompts for alignment without parameter tuning.
Figure 2: Evolution of selected MLLMs over the past five years, where colored branches indicate different alignment stages. For models with multiple alignment stages, the final stage is represented.
Figure 3: Monolingual Large Language Model v.s. Multilingual Large Language Model.
Figure 4: Taxonomy of MLLMs which includes Parameter-Tuning Alignment Methodology and Parameter-Frozen Alignment Methodology.
Figure 5: Overview of Parameter-Tuning Alignment ($\S$\ref{['sec:parameter_tuned_align']}) Methods, which including PTA in Pretraining Stage ($\S$\ref{['sec:pretrain_align']}), PTA in SFT stage ($\S$\ref{['sec:sft_align']}), PTA in RLHF stage ($\S$\ref{['sec:rlhf_align']}) and PTA in Downstream Finetuning stage ($\S$\ref{['sec:ft_align']}).
...and 1 more figures

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

TL;DR

Abstract

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

Authors

TL;DR

Abstract

Table of Contents

Figures (6)