Table of Contents
Fetching ...

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata

TL;DR

M4-RAG introduces the first large-scale benchmark for multilingual, multicultural, and multimodal RAG by combining 42 languages with 56 dialects and over 80k image–question pairs drawn from WorldCuisines and CVQA. It couples this with a controlled multilingual retrieval environment built from Wikipedia snapshots to study when retrieval augments or hinders reasoning in vision–language models across modalities and languages. Across 11 models and four retrieval setups, the study reveals a consistent pattern: retrieval benefits smaller VLMs but can degrade larger models, and that current multilingual grounding remains English-centric with significant gaps for low-resource languages. The work highlights the need for stronger cross-lingual and cross-modal alignment, more robust retrieval, and better integration of cultural context to realize truly universal multimodal VQA systems.

Abstract

Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

TL;DR

M4-RAG introduces the first large-scale benchmark for multilingual, multicultural, and multimodal RAG by combining 42 languages with 56 dialects and over 80k image–question pairs drawn from WorldCuisines and CVQA. It couples this with a controlled multilingual retrieval environment built from Wikipedia snapshots to study when retrieval augments or hinders reasoning in vision–language models across modalities and languages. Across 11 models and four retrieval setups, the study reveals a consistent pattern: retrieval benefits smaller VLMs but can degrade larger models, and that current multilingual grounding remains English-centric with significant gaps for low-resource languages. The work highlights the need for stronger cross-lingual and cross-modal alignment, more robust retrieval, and better integration of cultural context to realize truly universal multimodal VQA systems.

Abstract

Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.

Paper Structure

This paper contains 32 sections, 2 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: The overall framework of M4‑RAG comprises four configurations: (a) a No‑RAG baseline, where the VLM (M) directly takes the question and image as input and predicts a response answer; (b) a No‑RAG setup augmented with ground‑truth context, which is concatenated with the question and image to probe the upper bound of how much perfectly relevant knowledge can help; (c) a text‑based RAG configuration, where a multimodal encoder ($E_{\text{mm}}$) encodes the query (image + question), compares it against an indexed document collection, retrieves the top textual context, and feeds this retrieved text together with the original inputs; and (d) a multimodal RAG configuration, where documents are stored with embeddings from a text encoder and retrieval can leverage both textual and visual signals, yielding richer multimodal context. Across (c) and (d), the retrieved context is treated as an additional conditioning signal that steers the model toward culturally relevant knowledge while keeping the backbone VLM architecture unchanged.
  • Figure 2: Overall VQA performance on CVQA and WorldCuisines across different model families and sizes, with different retrieval configurations. Each column corresponds to a VLM family (Qwen2.5‑VL, Gemma3, Qwen3), and each panel plots accuracy as a function of model size. Across all families and scales, adding retrieval (solid lines) consistently improves over the No‑RAG baseline (dotted black), with the multimodal RAG variants approaching the Oracle‑Context upper bound. Gains are especially pronounced on the more culturally nuanced WorldCuisines benchmark, where smaller models with RAG can match or exceed much larger non‑RAG models, illustrating that external knowledge is more beneficial than pure parameter scaling in this setting. Among RAG settings, mmE5‑based retrieval generally outperforms B3 and caption‑only retrieval, highlighting the importance of a strong multimodal encoder and joint use of image and query signals to surface culturally relevant evidence.
  • Figure 3: Performance differences on "no RAG" setting for two models across languages grouped by vitality (high-, medium-, and low-resource). Darker blue indicates larger performance drops when using multilingual prompts relative to English prompts, whereas stronger green indicates performance gains under multilingual prompting.
  • Figure 4: The effect of retrieval quality on RAG performance for various models on the CVQA dataset, using mmE5 for multimodal retrieval. Left: The "Correctness Retention" rate measures the percentage of responses that were correct without RAG and remained correct with RAG. Right: The "Correction Rate" measures the percentage of responses that were incorrect without RAG but were successfully corrected by RAG.
  • Figure 5: The effect of retrieval quality on RAG performance for various models on the CVQA dataset, using B3 for multimodal retrieval. Left: The "Correctness Retention" rate measures the percentage of responses that were correct without RAG and remained correct with RAG. Right: The "Correction Rate" measures the percentage of responses that were incorrect without RAG but were successfully corrected by RAG.
  • ...and 8 more figures