Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers
Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen, Yinghui Li, Lizi Liao, Min Li, Wanxiang Che, Philip S. Yu
TL;DR
This survey addresses the lack of a unified view for multilingual large language models (MLLMs) by introducing a two-family alignment taxonomy: parameter-tuning alignment (PTA) and parameter-frozen alignment (PFA). It systematically catalogs data resources across pretraining, SFT, and RLHF stages, and provides a detailed synthesis of PTA workflows (pretraining, SFT, RLHF, and downstream finetuning) as well as PFA prompting strategies (direct, code-switching, translation-based, and retrieval-augmented). The paper also highlights emerging frontiers such as hallucination, knowledge editing, safety, fairness, language extension, and multimodality, and offers practical guidance through an organized resources appendix. Overall, the work aims to unify the MLLM literature, clarify methodological options, and accelerate progress by linking data, methods, and evaluation challenges across languages.
Abstract
Multilingual Large Language Models are capable of using powerful Large Language Models to handle and respond to queries in multiple languages, which achieves remarkable success in multilingual natural language processing tasks. Despite these breakthroughs, there still remains a lack of a comprehensive survey to summarize existing approaches and recent developments in this field. To this end, in this paper, we present a thorough review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. The contributions of this paper can be summarized: (1) First survey: to our knowledge, we take the first step and present a thorough review in MLLMs research field according to multi-lingual alignment; (2) New taxonomy: we offer a new and unified perspective to summarize the current progress of MLLMs; (3) New frontiers: we highlight several emerging frontiers and discuss the corresponding challenges; (4) Abundant resources: we collect abundant open-source resources, including relevant papers, data corpora, and leaderboards. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
