Table of Contents
Fetching ...

Large Multimodal Models for Low-Resource Languages: A Survey

Marian Lupascu, Ana-Cristina Rogoz, Mihai Sorin Stupariu, Radu Tudor Ionescu

TL;DR

This survey analyzes how large multimodal models are being adapted for low-resource languages by synthesizing 117 studies across 96 languages and proposing a six-category taxonomy (data creation, synthetic data, fusion, visual enhancement, cross-modal transfer, and architectural innovations). It finds text-image pairs dominate the field, with uneven language coverage and increasing attention to data resources, fusion strategies, and efficient architectures, while highlighting persistent evaluation and governance challenges. Visual information frequently benefits LR multimodal tasks, yet issues such as hallucination and computational constraints limit broader adoption. The work offers practical guidelines, an open-source repository, and future directions to promote equitable, community-centered progress in LR multimodal NLP.

Abstract

In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 117 studies across 96 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We categorize works into resource-oriented and method-oriented contributions, further dividing contributions into relevant sub-categories. We compare method-oriented contributions in terms of performance and efficiency, discussing benefits and limitations of representative studies. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. In summary, we provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.

Large Multimodal Models for Low-Resource Languages: A Survey

TL;DR

This survey analyzes how large multimodal models are being adapted for low-resource languages by synthesizing 117 studies across 96 languages and proposing a six-category taxonomy (data creation, synthetic data, fusion, visual enhancement, cross-modal transfer, and architectural innovations). It finds text-image pairs dominate the field, with uneven language coverage and increasing attention to data resources, fusion strategies, and efficient architectures, while highlighting persistent evaluation and governance challenges. Visual information frequently benefits LR multimodal tasks, yet issues such as hallucination and computational constraints limit broader adoption. The work offers practical guidelines, an open-source repository, and future directions to promote equitable, community-centered progress in LR multimodal NLP.

Abstract

In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 117 studies across 96 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We categorize works into resource-oriented and method-oriented contributions, further dividing contributions into relevant sub-categories. We compare method-oriented contributions in terms of performance and efficiency, discussing benefits and limitations of representative studies. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. In summary, we provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.

Paper Structure

This paper contains 10 sections, 5 figures, 11 tables.

Figures (5)

  • Figure 1: A Venn diagram with the distribution of papers across different modality combinations used by LMMs for low-resource languages. Text+image is the dominant modality pair, while more complex video-inclusive combinations are less common. A selection of representative papers is included for each modality combination. References are clickable links to papers.
  • Figure 2: Distribution of papers across 96 low-resource languages, representing 117 papers. Hindi leads with 31 studies, followed by Arabic (23), Bengali (21), Malayalam (19), Tamil (14), Korean and Yoruba (with 10 papers each). The remaining languages have less than 10 papers each. Languages with only one paper (42 languages) are listed using ISO 639-1 codes. The data highlights the disparity in research focus among LR languages, with a few languages receiving more focus, while many others remain understudied in the context of multimodal learning. Some papers simultaneously address multiple languages, contributing to the individual language counts. HR languages such as English, Chinese, Mandarin and Spanish are excluded from this chart.
  • Figure 3: High-level taxonomy of LMMs for low-resource languages. We depict six main categories (inside boxes with green background), which are further divided into subcategories, exemplified via a few representative studies. References are clickable links to papers.
  • Figure 4: Number of LMM papers for LR languages published per year (2018-2025), categorized by technique: Multimodal Data Creation, Synthetic Data Generation, Multimodal Fusion Techniques, Visual Enhancement Techniques, Cross-Modal Transfer Learning, and Architectural Innovations. Best viewed in color.
  • Figure 5: An overview of various fusion strategies employed in LMMs, categorized into early fusion, late fusion, and architectural fusion approaches. Early fusion combines features from different modalities (text, audio, and visual) using feature extractors and fusion techniques, before passing them to a classifier for the final output. Concatenation fusion directly concatenates features from different modalities, while gated fusion employs a gate controller network to regulate information flow between modalities. Late fusion processes each modality using separate models, then combines their predictions using decision-level fusion methods, such as majority voting or weighted averaging. Architectural fusion approaches, such as attention fusion and encoder-decoder fusion, provide more sophisticated methods for multimodal integration. Attention fusion leverages self-attention layers and learned attention weights to selectively focus on relevant features across modalities.