Table of Contents
Fetching ...

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

Hongyu Wang, Jiayu Xu, Senwei Xie, Ruiping Wang, Jialin Li, Zhaojie Xie, Bin Zhang, Chuyan Xiong, Xilin Chen

TL;DR

M4U introduces a large-scale, expert-level multilingual multimodal benchmark (10,005 questions across 64 disciplines and six languages) with interleaved image-text content to rigorously test perception, knowledge, and reasoning. The study systematically evaluates 22 LMMs and 4 LLMs under zero-shot and tool-assisted conditions, revealing considerable gaps in multilingual multimodal reasoning, including language biases and cross-lingual degradation. Key findings show GPT-4o achieving only 47.6% average accuracy, end-to-end cross-lingual prompting often outperforming translate-then-questioning, and substantial variability in chain-of-thought benefits across languages and models. The work provides public datasets, thorough quality controls, and analysis that will drive future improvements in multilingual multimodal understanding and reasoning.

Abstract

Multilingual capability is an essential aspect for large multimodal models, since they are usually deployed across various countries and languages. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely unexplored. In this work, we introduce M4U, a novel and challenging benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning. M4U contains 10k samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in six languages. Using M4U, we conduct extensive evaluations of leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools. The evaluation results demonstrate that the state-of-the-art model, GPT-4o, achieves only 47.6% average accuracy on M4U. Additionally, we observe that the leading LMMs exhibit significant language preferences. Our in-depth analysis indicates that leading LMMs, including GPT-4o, struggle to perform reasoning using multilingual information present in both visual and textual context. Specifically, they suffer performance degradation when prompted with cross-lingual multimodal questions. Our code and dataset is public available.

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

TL;DR

M4U introduces a large-scale, expert-level multilingual multimodal benchmark (10,005 questions across 64 disciplines and six languages) with interleaved image-text content to rigorously test perception, knowledge, and reasoning. The study systematically evaluates 22 LMMs and 4 LLMs under zero-shot and tool-assisted conditions, revealing considerable gaps in multilingual multimodal reasoning, including language biases and cross-lingual degradation. Key findings show GPT-4o achieving only 47.6% average accuracy, end-to-end cross-lingual prompting often outperforming translate-then-questioning, and substantial variability in chain-of-thought benefits across languages and models. The work provides public datasets, thorough quality controls, and analysis that will drive future improvements in multilingual multimodal understanding and reasoning.

Abstract

Multilingual capability is an essential aspect for large multimodal models, since they are usually deployed across various countries and languages. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely unexplored. In this work, we introduce M4U, a novel and challenging benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning. M4U contains 10k samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in six languages. Using M4U, we conduct extensive evaluations of leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools. The evaluation results demonstrate that the state-of-the-art model, GPT-4o, achieves only 47.6% average accuracy on M4U. Additionally, we observe that the leading LMMs exhibit significant language preferences. Our in-depth analysis indicates that leading LMMs, including GPT-4o, struggle to perform reasoning using multilingual information present in both visual and textual context. Specifically, they suffer performance degradation when prompted with cross-lingual multimodal questions. Our code and dataset is public available.
Paper Structure (23 sections, 32 figures, 7 tables)

This paper contains 23 sections, 32 figures, 7 tables.

Figures (32)

  • Figure 1: An illustration of multi-discipline multilingual multimodal understanding. Both textual questions and images contain the multilingual contents. We highlight the Chinese contents in yellow. English translations are provided for better readability.
  • Figure 2: Key statistics of M4U dataset. M4U covers a wide scope of tasks from Science, Engineering and Health in Chinese, English and German, and supports the interleaved vision-language documents.
  • Figure 3: An example from the Chemistry-Inorganic of M4U dataset. The sample contains multiple images, and has multilingual contents in the question and images.
  • Figure 4: The zero-shot accuracy of different LMMs on different image types (Left) and positions (Right) on M4U dataset.
  • Figure 5: The zero-shot accuracy of GPT-4o across 64 subjects on M4U dataset.
  • ...and 27 more figures