Table of Contents
Fetching ...

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen, Yinghui Li, Lizi Liao, Min Li, Wanxiang Che, Philip S. Yu

TL;DR

This survey addresses the lack of a unified view for multilingual large language models (MLLMs) by introducing a two-family alignment taxonomy: parameter-tuning alignment (PTA) and parameter-frozen alignment (PFA). It systematically catalogs data resources across pretraining, SFT, and RLHF stages, and provides a detailed synthesis of PTA workflows (pretraining, SFT, RLHF, and downstream finetuning) as well as PFA prompting strategies (direct, code-switching, translation-based, and retrieval-augmented). The paper also highlights emerging frontiers such as hallucination, knowledge editing, safety, fairness, language extension, and multimodality, and offers practical guidance through an organized resources appendix. Overall, the work aims to unify the MLLM literature, clarify methodological options, and accelerate progress by linking data, methods, and evaluation challenges across languages.

Abstract

Multilingual Large Language Models are capable of using powerful Large Language Models to handle and respond to queries in multiple languages, which achieves remarkable success in multilingual natural language processing tasks. Despite these breakthroughs, there still remains a lack of a comprehensive survey to summarize existing approaches and recent developments in this field. To this end, in this paper, we present a thorough review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. The contributions of this paper can be summarized: (1) First survey: to our knowledge, we take the first step and present a thorough review in MLLMs research field according to multi-lingual alignment; (2) New taxonomy: we offer a new and unified perspective to summarize the current progress of MLLMs; (3) New frontiers: we highlight several emerging frontiers and discuss the corresponding challenges; (4) Abundant resources: we collect abundant open-source resources, including relevant papers, data corpora, and leaderboards. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

TL;DR

This survey addresses the lack of a unified view for multilingual large language models (MLLMs) by introducing a two-family alignment taxonomy: parameter-tuning alignment (PTA) and parameter-frozen alignment (PFA). It systematically catalogs data resources across pretraining, SFT, and RLHF stages, and provides a detailed synthesis of PTA workflows (pretraining, SFT, RLHF, and downstream finetuning) as well as PFA prompting strategies (direct, code-switching, translation-based, and retrieval-augmented). The paper also highlights emerging frontiers such as hallucination, knowledge editing, safety, fairness, language extension, and multimodality, and offers practical guidance through an organized resources appendix. Overall, the work aims to unify the MLLM literature, clarify methodological options, and accelerate progress by linking data, methods, and evaluation challenges across languages.

Abstract

Multilingual Large Language Models are capable of using powerful Large Language Models to handle and respond to queries in multiple languages, which achieves remarkable success in multilingual natural language processing tasks. Despite these breakthroughs, there still remains a lack of a comprehensive survey to summarize existing approaches and recent developments in this field. To this end, in this paper, we present a thorough review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. The contributions of this paper can be summarized: (1) First survey: to our knowledge, we take the first step and present a thorough review in MLLMs research field according to multi-lingual alignment; (2) New taxonomy: we offer a new and unified perspective to summarize the current progress of MLLMs; (3) New frontiers: we highlight several emerging frontiers and discuss the corresponding challenges; (4) Abundant resources: we collect abundant open-source resources, including relevant papers, data corpora, and leaderboards. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
Paper Structure (50 sections, 2 equations, 6 figures, 2 tables)

This paper contains 50 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Parameter-Tuning Alignment ($\S$\ref{['sec:parameter_tuned_align']}) v.s. Parameter-Frozen Alignment ($\S$\ref{['sec:param_freeze_align']}). The former requires the model to fine-tune the MLLM parameters for cross-lingual alignment, while the latter directly uses prompts for alignment without parameter tuning.
  • Figure 2: Evolution of selected MLLMs over the past five years, where colored branches indicate different alignment stages. For models with multiple alignment stages, the final stage is represented.
  • Figure 3: Monolingual Large Language Model v.s. Multilingual Large Language Model.
  • Figure 4: Taxonomy of MLLMs which includes Parameter-Tuning Alignment Methodology and Parameter-Frozen Alignment Methodology.
  • Figure 5: Overview of Parameter-Tuning Alignment ($\S$\ref{['sec:parameter_tuned_align']}) Methods, which including PTA in Pretraining Stage ($\S$\ref{['sec:pretrain_align']}), PTA in SFT stage ($\S$\ref{['sec:sft_align']}), PTA in RLHF stage ($\S$\ref{['sec:rlhf_align']}) and PTA in Downstream Finetuning stage ($\S$\ref{['sec:ft_align']}).
  • ...and 1 more figures