M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

David Anugraha; Patrick Amadeus Irawan; Anshul Singh; En-Shiun Annie Lee; Genta Indra Winata

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata

TL;DR

M4-RAG introduces the first large-scale benchmark for multilingual, multicultural, and multimodal RAG by combining 42 languages with 56 dialects and over 80k image–question pairs drawn from WorldCuisines and CVQA. It couples this with a controlled multilingual retrieval environment built from Wikipedia snapshots to study when retrieval augments or hinders reasoning in vision–language models across modalities and languages. Across 11 models and four retrieval setups, the study reveals a consistent pattern: retrieval benefits smaller VLMs but can degrade larger models, and that current multilingual grounding remains English-centric with significant gaps for low-resource languages. The work highlights the need for stronger cross-lingual and cross-modal alignment, more robust retrieval, and better integration of cultural context to realize truly universal multimodal VQA systems.

Abstract

Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

TL;DR

Abstract

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)