Table of Contents
Fetching ...

MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

Weihua Zheng, Zhengyuan Liu, Tanmoy Chakraborty, Weiwen Xu, Xiaoxue Gao, Bryan Chen Zhengyu Tan, Bowei Zou, Chang Liu, Yujia Hu, Xing Xie, Xiaoyuan Yi, Jing Yao, Chaojun Wang, Long Li, Rui Liu, Huiyao Liu, Koji Inoue, Ryuichi Sumida, Tatsuya Kawahara, Fan Xu, Lingyu Ye, Wei Tian, Dongjun Kim, Jimin Jung, Jaehyung Seo, Nadya Yuki Wangsajaya, Pham Minh Duc, Ojasva Saxena, Palash Nandi, Xiyan Tao, Wiwik Karlina, Tuan Luong, Keertana Arun Vasan, Roy Ka-Wei Lee, Nancy F. Chen

TL;DR

MMA-ASIA tackles the uneven cultural understanding of LLMs across languages and modalities by introducing a tri-modal, multilingual benchmark spanning 8 Asian countries and 10 languages with 27,000 questions, most requiring multi-step cultural reasoning. It defines a five-dimensional evaluation protocol emphasizing cross-lingual and cross-modal consistency, grounding validation, and generalization under held-out themes, and it provides LLM-as-Judge and Vision-ablated Prefix Replay tools to analyze failures. Experimental results show persistent language-resource gaps, weaker cross-modal transfer, and grounding shortcuts, with accents sometimes acting as useful cultural priors. The work releases extensive data, evaluation scripts, and baselines to guide the development of culturally reliable, multilingual multimodal LLMs and highlights the need for grounding-aware evaluation and stronger cross-modal alignment.

Abstract

Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high-resource settings. We propose MMA-ASIA, a comprehensive framework to evaluate LLMs' cultural awareness with a focus on Asian contexts. MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned multiple-choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79 percent require multi-step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross-modal transfer. Building on this benchmark, we propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects "shortcut learning" by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision-ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.

MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

TL;DR

MMA-ASIA tackles the uneven cultural understanding of LLMs across languages and modalities by introducing a tri-modal, multilingual benchmark spanning 8 Asian countries and 10 languages with 27,000 questions, most requiring multi-step cultural reasoning. It defines a five-dimensional evaluation protocol emphasizing cross-lingual and cross-modal consistency, grounding validation, and generalization under held-out themes, and it provides LLM-as-Judge and Vision-ablated Prefix Replay tools to analyze failures. Experimental results show persistent language-resource gaps, weaker cross-modal transfer, and grounding shortcuts, with accents sometimes acting as useful cultural priors. The work releases extensive data, evaluation scripts, and baselines to guide the development of culturally reliable, multilingual multimodal LLMs and highlights the need for grounding-aware evaluation and stronger cross-modal alignment.

Abstract

Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high-resource settings. We propose MMA-ASIA, a comprehensive framework to evaluate LLMs' cultural awareness with a focus on Asian contexts. MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned multiple-choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79 percent require multi-step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross-modal transfer. Building on this benchmark, we propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects "shortcut learning" by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision-ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.

Paper Structure

This paper contains 49 sections, 6 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: An overview of the MMA-Asia evaluation framework: data creation pipeline, representative dataset samples, and evaluation dimensions.
  • Figure 2: Performance of LLMs on MMA-Asia across modalities. Exact values are provided in Appendix \ref{['sec:Performance of LLMs']}. For each country and modality, the dataset contains 500 questions presented in multiple languages. The vertical axis reports Accuracy (%), defined as the number of items where the model’s chosen option exactly matches the correct option, divided by 500. The x-axis label {Country}--{Language} denotes the cultural dataset for {Country}, presented in {Language}.
  • Figure 3: Rationale Unfaithfulness Rates of LLMs across text-only and VQA. Similar trends are observed for Rephrase VQA and Spoken QA; detailed results are provided in Appendix \ref{['sec:RUR']}.
  • Figure 4: (a) Cross-lingual consistency with fixed country and modality and (b) cross-modal consistency with fixed language and country. TX/VL/RE/SP represent text QA, visual QA, rephrase QA, and speech QA.
  • Figure 5: Attention heatmap visualization over image regions during incorrect model answers. Color scale from blue (low) to red (high) indicates increasing model attention.
  • ...and 9 more figures