Table of Contents
Fetching ...

JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation

Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Kazuki Egashira, Jeonghun Baek, Xiang Yue, Graham Neubig, Kiyoharu Aizawa

TL;DR

JMMMU addresses the gap in evaluating LMMs beyond English by introducing a Japanese MMMU-style benchmark with two complementary subsets: CA, a translation-based culture-agnostic set, and CS, newly crafted culture-specific content reflecting Japanese context. The dataset comprises 1,320 questions and 1,118 images across 28 subjects, enabling diagnosis of both language proficiency and culture-aware reasoning. Experimental results show a sizable gap between open-source and proprietary models, with CS performance revealing pronounced cultural understanding gaps even when CA performance is strong, and translation effects highlighting potential evaluation biases. The work emphasizes the need for culture-centric multilingual benchmarks across languages to guide inclusive LMM development and provides a framework for expanding such benchmarks to other cultures.

Abstract

Accelerating research on Large Multimodal Models (LMMs) in non-English languages is crucial for enhancing user experiences across broader populations. In this paper, we introduce JMMMU (Japanese MMMU), the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. To facilitate comprehensive culture-aware evaluation, JMMMU features two complementary subsets: (i) culture-agnostic (CA) subset, where the culture-independent subjects (e.g., Math) are selected and translated into Japanese, enabling one-to-one comparison with its English counterpart MMMU; and (ii) culture-specific (CS) subset, comprising newly crafted subjects that reflect Japanese cultural context. Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation. Using the CS subset, we reveal their inadequate Japanese cultural understanding. Further, by combining both subsets, we identify that some LMMs perform well on the CA subset but not on the CS subset, exposing a shallow understanding of the Japanese language that lacks depth in cultural understanding. We hope this work will not only help advance LMM performance in Japanese but also serve as a guideline to create high-standard, culturally diverse benchmarks for multilingual LMM development. The project page is https://mmmu-japanese-benchmark.github.io/JMMMU/.

JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation

TL;DR

JMMMU addresses the gap in evaluating LMMs beyond English by introducing a Japanese MMMU-style benchmark with two complementary subsets: CA, a translation-based culture-agnostic set, and CS, newly crafted culture-specific content reflecting Japanese context. The dataset comprises 1,320 questions and 1,118 images across 28 subjects, enabling diagnosis of both language proficiency and culture-aware reasoning. Experimental results show a sizable gap between open-source and proprietary models, with CS performance revealing pronounced cultural understanding gaps even when CA performance is strong, and translation effects highlighting potential evaluation biases. The work emphasizes the need for culture-centric multilingual benchmarks across languages to guide inclusive LMM development and provides a framework for expanding such benchmarks to other cultures.

Abstract

Accelerating research on Large Multimodal Models (LMMs) in non-English languages is crucial for enhancing user experiences across broader populations. In this paper, we introduce JMMMU (Japanese MMMU), the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. To facilitate comprehensive culture-aware evaluation, JMMMU features two complementary subsets: (i) culture-agnostic (CA) subset, where the culture-independent subjects (e.g., Math) are selected and translated into Japanese, enabling one-to-one comparison with its English counterpart MMMU; and (ii) culture-specific (CS) subset, comprising newly crafted subjects that reflect Japanese cultural context. Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation. Using the CS subset, we reveal their inadequate Japanese cultural understanding. Further, by combining both subsets, we identify that some LMMs perform well on the CA subset but not on the CS subset, exposing a shallow understanding of the Japanese language that lacks depth in cultural understanding. We hope this work will not only help advance LMM performance in Japanese but also serve as a guideline to create high-standard, culturally diverse benchmarks for multilingual LMM development. The project page is https://mmmu-japanese-benchmark.github.io/JMMMU/.

Paper Structure

This paper contains 53 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of the JMMMU dataset. JMMMU includes 720 culture-agnostic (translation-based) questions and 600 culture-specific (newly created) questions, totaling 1,320 questions, thus expanding the existing culture-aware Japanese benchmark inoue2024heron by over 10 times. JMMMU serves as a diagnostic tool for assessing both Japanese cultural understanding and culture-agnostic language understanding capability.
  • Figure 2: Example of the image translation process. English words in the image are manually overwritten with Japanese.
  • Figure 3: Score correlation between subsets. While proprietary models ($\blacksquare$) perform the best on both subsets, Japanese LMMs ($\bigstar$) and Pangea ($\blacklozenge$, a culture-aware multilingual LMM) perform remarkably high on CS subset compared to models that perform similarly on CA subset.
  • Figure 4: (a) There are a considerable amount of questions to which GPT-4o answers correctly only in either one of the languages (yellow + orange). (b) In Japanese, the model relatively more often goes against the instruction that asks to answer directly and generates its reasoning process, leading to a correct answer.
  • Figure 5: Error distribution over culture-specific subjects. Lack of Knowledge is the majority error type at over 50%.
  • ...and 8 more figures