A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

Kaiyu Huang; Fengran Mo; Xinyu Zhang; Hongliang Li; You Li; Yuanchi Zhang; Weijian Yi; Yulong Mao; Jinchen Liu; Yuzhuang Xu; Jinan Xu; Jian-Yun Nie; Yang Liu

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

Kaiyu Huang, Fengran Mo, Xinyu Zhang, Hongliang Li, You Li, Yuanchi Zhang, Weijian Yi, Yulong Mao, Jinchen Liu, Yuzhuang Xu, Jinan Xu, Jian-Yun Nie, Yang Liu

TL;DR

The paper surveys Large Language Models in multilingual contexts, proposing a structured taxonomy and multi-perspective analysis across training, inference, information retrieval, security, and domain-specific applications. It synthesizes current approaches—ranging from training-from-scratch to continual learning, direct and pre-translation inference, and retrieval-augmented methods—while highlighting critical limitations and safety concerns, especially in low-resource languages. Key contributions include a multi-angled framework, identification of future directions, and a community repository to track rapid developments, aiming to advance language-fair, globally accessible NLP. The work emphasizes data resources, benchmarks, bias mitigation, and domain-adapted multilingual LLMs as essential for practical, broadly usable multilingual AI systems.

Abstract

The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing, attracting global attention in both academia and industry. To mitigate potential discrimination and enhance the overall usability and accessibility for diverse language user groups, it is important for the development of language-fair technology. Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient, where a comprehensive survey to summarize recent approaches, developments, limitations, and potential solutions is desirable. To this end, we provide a survey with multiple perspectives on the utilization of LLMs in the multilingual scenario. We first rethink the transitions between previous and current research on pre-trained language models. Then we introduce several perspectives on the multilingualism of LLMs, including training and inference methods, information retrieval, model security, multi-domain with language culture, and usage of datasets. We also discuss the major challenges that arise in these aspects, along with possible solutions. Besides, we highlight future research directions that aim at further enhancing LLMs with multilingualism. The survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

TL;DR

Abstract

Paper Structure (47 sections, 1 equation, 3 figures, 12 tables)

This paper contains 47 sections, 1 equation, 3 figures, 12 tables.

Introduction
Preliminary
Multilingual Models
Pre-Trained Language Models
Multilingual Paradigm Transition
Large Language Models with Multilingual Capability
Training from Scratch
Continual Training
Limitations and Future Directions on Training Paradigm
Multilingual Inference Strategies
Direct Inference in Multilingual Models
Pre-Translation Inference
Multilingual CoT
Code-Switching
Multilingual Retrieval Augmented Generation
...and 32 more sections

Figures (3)

Figure 1: An illustration of the training process of LLMs with a fail case in each phase caused by multilingualism. Due to the long context of the shown case, we present only the key parts.
Figure 2: A structured taxonomy of LLMs with multilingualism which categorizes current studies.
Figure 3: An overview of representative LLMs and mPLMs in recent years. The illustration consists of one tree that shows the transition of two paradigms ("Pre-train, Fine-tune"$\rightarrow$"Pre-train, Prompt, Predict"), including three model's architectures (encoder-only, decoder-only, and encoder-decoder) and four new frontiers for LLMs with multilingualism.

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

TL;DR

Abstract

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

Authors

TL;DR

Abstract

Table of Contents

Figures (3)