Table of Contents
Fetching ...

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani, Sebastian Cavada, Jenny Chim, Rohit Gupta, Sanjay Manjunath, Kamila Zhumakhanova, Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ketan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei, Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabrera, Johan Obando-Ceron, Olympiah Otieno, Fabian Farestam, Muztoba Rabbani, Sanoojan Baliah, Santosh Sanjeev, Abduragim Shtanchaev, Maheen Fatima, Thao Nguyen, Amrin Kareem, Toluwani Aremu, Nathan Xavier, Amit Bhatkal, Hawau Toyin, Aman Chadha, Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Jorma Laaksonen, Thamar Solorio, Monojit Choudhury, Ivan Laptev, Mubarak Shah, Salman Khan, Fahad Khan

TL;DR

The paper addresses the lack of culturally and linguistically diverse evaluation for LMMs. It introduces ALM-bench, a culturally rich multilingual VQA benchmark spanning 100 languages, 24 scripts, 73 countries, 19 domains, with 22.7k QA pairs and native-verified annotations. It benchmarks 16 LMMs (open and closed) and finds a substantial performance gap favoring closed-source models (GPT-4o ~78.8% vs best open ~51.9%), with pronounced challenges for low-resource languages and scripts; it also shows that including country-context in prompts can yield gains and that visual input is critical. The results highlight the need for broader cultural coverage, better language resource modeling, and improved evaluation protocols to guide the development of globally inclusive multimodal AI.

Abstract

Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model's ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark is publicly available.

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

TL;DR

The paper addresses the lack of culturally and linguistically diverse evaluation for LMMs. It introduces ALM-bench, a culturally rich multilingual VQA benchmark spanning 100 languages, 24 scripts, 73 countries, 19 domains, with 22.7k QA pairs and native-verified annotations. It benchmarks 16 LMMs (open and closed) and finds a substantial performance gap favoring closed-source models (GPT-4o ~78.8% vs best open ~51.9%), with pronounced challenges for low-resource languages and scripts; it also shows that including country-context in prompts can yield gains and that visual input is critical. The results highlight the need for broader cultural coverage, better language resource modeling, and improved evaluation protocols to guide the development of globally inclusive multimodal AI.

Abstract

Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model's ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark is publicly available.

Paper Structure

This paper contains 22 sections, 26 figures, 6 tables.

Figures (26)

  • Figure 1: ALM-bench comprises a diverse set of 100 languages with manually verified annotations by respective native language experts. Here, qualitative examples highlight the comprehensive set of 13 cultural aspects covered in the benchmark, such as heritage, customs, architecture, literature, music, and sports. It also evaluates visual understanding for six generic aspects. The ALM-bench focuses on low-resource languages and different regions, spanning 73 countries across five continents and 24 distinct scripts. ALM-bench covers diverse questions, such as multiple choice questions (MCQs), true/false (T/F), short and long visual question answers (VQAs).
  • Figure 2: Benchmarking LMMs on diverse languages & cultures: (a) Performance of various open versus closed-sourced LMMs on ALM-bench. For each LMM, we also report performance on low versus high-resource languages. All these carefully selected models were released after 2023. (b) Performance of high-performing LMMs on our culturally curated 13 categories available in our ALM-bench.
  • Figure 3: Data collection and verification pipeline. Our benchmark features both cultural specific content sourced from the web (left) and generic image understanding collection sourced from existing LMM benchmark. The cultural part is carefully filtered to remove noisy samples and private information. We use GPT4o for translations which are manually verified and corrected with over 800 hours of human annotators (native speakers). Our ALM-bench has diverse question types and features approximately 23K QA pairs in total in 100 languages.
  • Figure 4: Data statistics of our ALM-bench showing the diversity of the scripts, global coverage, comprehensive categories, and various question types. Our dataset contains 22.7K high-quality question-answers in total, covering 100 languages and 24 scripts. All the samples are manually verified by native speakers.
  • Figure 5: Analysis of the type of mistakes GPT-4o made while language translations via human-feedback. GPT-4o encounters more issues with respect to semantic and grammatical errors.
  • ...and 21 more figures