GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

Yi Zong; Xipeng Qiu

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

Yi Zong, Xipeng Qiu

TL;DR

GAOKAO-MM introduces a native-Chinese, human-level multimodal benchmark derived from the GAOKAO exam, comprising $646$ questions across $8$ subjects and $897$ images across $12$ types to probe perception, understanding, knowledge, and reasoning. It adopts a zero-shot evaluation framework over $10$ LVLMs with subject-specific prompts and explicit reasoning outputs, revealing that all models score below $50\%$, with GPT-4V leading at $48.1\%$ and open-source models lagging by more than $11\%$ behind closed-source counterparts. The analysis exposes weaknesses in mathematical reasoning, long-text comprehension, and robustness to year-to-year question-image variations, while highlighting that native Chinese context and detailed explanations can both challenge and guide LVLM development. GAOKAO-MM thus serves as a benchmark to spur progress toward true multimodal understanding in Chinese contexts and informs broader multilingual LVLM advancement, education-focused applications, and future extensions with expanded data and deeper inference analysis.

Abstract

The Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding. However, existing multimodal benchmarks focus on primary perception abilities and commonsense knowledge which are insufficient to reflect the comprehensive capabilities of LVLMs. We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO), comprising of 8 subjects and 12 types of images, such as diagrams, function graphs, maps and photos. GAOKAO-MM derives from native Chinese context and sets human-level requirements for the model's abilities, including perception, understanding, knowledge and reasoning. We evaluate 10 LVLMs and find that the accuracies of all of them are lower than 50%, with GPT-4-Vison (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking in the top three positions. The results of our multi-dimension analysis indicate that LVLMs have moderate distance towards Artificial General Intelligence (AGI) and provide insights facilitating the development of multilingual LVLMs.

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

TL;DR

GAOKAO-MM introduces a native-Chinese, human-level multimodal benchmark derived from the GAOKAO exam, comprising

questions across

subjects and

images across

types to probe perception, understanding, knowledge, and reasoning. It adopts a zero-shot evaluation framework over

LVLMs with subject-specific prompts and explicit reasoning outputs, revealing that all models score below

, with GPT-4V leading at

and open-source models lagging by more than

behind closed-source counterparts. The analysis exposes weaknesses in mathematical reasoning, long-text comprehension, and robustness to year-to-year question-image variations, while highlighting that native Chinese context and detailed explanations can both challenge and guide LVLM development. GAOKAO-MM thus serves as a benchmark to spur progress toward true multimodal understanding in Chinese contexts and informs broader multilingual LVLM advancement, education-focused applications, and future extensions with expanded data and deeper inference analysis.

Abstract

Paper Structure (18 sections, 6 figures, 3 tables)

This paper contains 18 sections, 6 figures, 3 tables.

Introduction
GAOKAO-MM
Dataset Description
Data Collection
Comparisons with Existing Benchmarks
Experiments
Methodology
Results
Analysis
Difference in Subjects
Difference in Image Types
Difference in Annual Trends
Conclusion
Key Statistics of GAOKAO-MM
Examples
...and 3 more sections

Figures (6)

Figure 1: An example of math question in GAOKAO-MM. The English translation in the text and images is added by the author for readers' understanding.
Figure 2: Different Performance in Subjects.
Figure 3: Different Performance in Image Types.
Figure 4: Difference in Annual Trends. The light-colored lines represent the accuracy obtained from three tests, while the dark-colored line represents the average accuracy.
Figure 5: Distribution of Image Types in GAOKAO-MM
...and 1 more figures

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

TL;DR

Abstract

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)