Table of Contents
Fetching ...

MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs

Yufei Gao, Jiaying Fei, Nuo Chen, Ruirui Chen, Guohang Yan, Yunshi Lan, Botian Shi

TL;DR

The paper addresses the gap in multimodal language models for low-resource languages by proposing a dual-objective framework that separately targets linguistic capability and cultural groundedness. It introduces MELLA, a dual-source, multilingual dataset built from native web alt-text for culture and MLLM-generated captions translated for language-focused descriptions, and demonstrates that fine-tuning with MELLA yields improvements in both denotative fluency and connotative cultural understanding across eight languages. The work includes a detailed data construction pipeline, training regime, and extensive experiments comparing against baselines, along with analyses of when and why the dual-source approach succeeds or faces challenges. The results suggest that combining culturally grounded data with linguistically rich descriptions produces more informative, thick descriptions that better serve low-resource language users. The dataset and methodology offer a practical path toward more culturally aware, linguistically capable MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce "thin descriptions", they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing "thick descriptions". We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.

MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs

TL;DR

The paper addresses the gap in multimodal language models for low-resource languages by proposing a dual-objective framework that separately targets linguistic capability and cultural groundedness. It introduces MELLA, a dual-source, multilingual dataset built from native web alt-text for culture and MLLM-generated captions translated for language-focused descriptions, and demonstrates that fine-tuning with MELLA yields improvements in both denotative fluency and connotative cultural understanding across eight languages. The work includes a detailed data construction pipeline, training regime, and extensive experiments comparing against baselines, along with analyses of when and why the dual-source approach succeeds or faces challenges. The results suggest that combining culturally grounded data with linguistically rich descriptions produces more informative, thick descriptions that better serve low-resource language users. The dataset and methodology offer a practical path toward more culturally aware, linguistically capable MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce "thin descriptions", they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing "thick descriptions". We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.

Paper Structure

This paper contains 37 sections, 8 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Image caption task performance on COCO dataset lin2015microsoftcococommonobjects across multiple languages. Compared to GPT-4o openai2024gpt4ocard, most of the outstanding MLLMs get the highest BLEU papineni-etal-2002-bleu score in English.
  • Figure 2: Standard MLLMs (e.g., InternVL2-8B, Qwen2-VL-7B) trained on generic datasets often fail to generate meaningful output due to limited visual-linguistic alignment. An MLLM with enhanced linguistic capability may produce detailed descriptions. However, only an MLLM enriched with cultural knowledge can accurately recognize the depicted celebrity. All conversations are expected to be in Arabic; "EN" provides translation for clarity.
  • Figure 3: Data Collection Pipeline for MELLA . We first collect images with native alt-text from regional websites to form the cultural knowledge dataset ($D_{know}$). For images without alt-text, we use a powerful MLLM to generate descriptive captions, which are then translated into target low-resource languages to form the linguistic capability dataset ($D_{ling}$). The combination of these two sources creates our final MELLA dataset.
  • Figure 4: Statistical overview of the MELLA dataset. Left: Main statistics including total sample numbers, sizes, and average text lengths across different languages. Middle: Circular diagram of the category distribution visualization. Right: Quantitative distribution showing the eight languages in the dataset with consistent color coding across the diagram. As shown, the MELLA dataset exhibits both broad coverage and balanced representation across topics and languages.
  • Figure 5: Human evaluation over 100 validation samples and 8 volunteers.
  • ...and 4 more figures