Table of Contents
Fetching ...

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

Yi Zong, Xipeng Qiu

TL;DR

GAOKAO-MM introduces a native-Chinese, human-level multimodal benchmark derived from the GAOKAO exam, comprising $646$ questions across $8$ subjects and $897$ images across $12$ types to probe perception, understanding, knowledge, and reasoning. It adopts a zero-shot evaluation framework over $10$ LVLMs with subject-specific prompts and explicit reasoning outputs, revealing that all models score below $50\%$, with GPT-4V leading at $48.1\%$ and open-source models lagging by more than $11\%$ behind closed-source counterparts. The analysis exposes weaknesses in mathematical reasoning, long-text comprehension, and robustness to year-to-year question-image variations, while highlighting that native Chinese context and detailed explanations can both challenge and guide LVLM development. GAOKAO-MM thus serves as a benchmark to spur progress toward true multimodal understanding in Chinese contexts and informs broader multilingual LVLM advancement, education-focused applications, and future extensions with expanded data and deeper inference analysis.

Abstract

The Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding. However, existing multimodal benchmarks focus on primary perception abilities and commonsense knowledge which are insufficient to reflect the comprehensive capabilities of LVLMs. We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO), comprising of 8 subjects and 12 types of images, such as diagrams, function graphs, maps and photos. GAOKAO-MM derives from native Chinese context and sets human-level requirements for the model's abilities, including perception, understanding, knowledge and reasoning. We evaluate 10 LVLMs and find that the accuracies of all of them are lower than 50%, with GPT-4-Vison (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking in the top three positions. The results of our multi-dimension analysis indicate that LVLMs have moderate distance towards Artificial General Intelligence (AGI) and provide insights facilitating the development of multilingual LVLMs.

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

TL;DR

GAOKAO-MM introduces a native-Chinese, human-level multimodal benchmark derived from the GAOKAO exam, comprising questions across subjects and images across types to probe perception, understanding, knowledge, and reasoning. It adopts a zero-shot evaluation framework over LVLMs with subject-specific prompts and explicit reasoning outputs, revealing that all models score below , with GPT-4V leading at and open-source models lagging by more than behind closed-source counterparts. The analysis exposes weaknesses in mathematical reasoning, long-text comprehension, and robustness to year-to-year question-image variations, while highlighting that native Chinese context and detailed explanations can both challenge and guide LVLM development. GAOKAO-MM thus serves as a benchmark to spur progress toward true multimodal understanding in Chinese contexts and informs broader multilingual LVLM advancement, education-focused applications, and future extensions with expanded data and deeper inference analysis.

Abstract

The Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding. However, existing multimodal benchmarks focus on primary perception abilities and commonsense knowledge which are insufficient to reflect the comprehensive capabilities of LVLMs. We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO), comprising of 8 subjects and 12 types of images, such as diagrams, function graphs, maps and photos. GAOKAO-MM derives from native Chinese context and sets human-level requirements for the model's abilities, including perception, understanding, knowledge and reasoning. We evaluate 10 LVLMs and find that the accuracies of all of them are lower than 50%, with GPT-4-Vison (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking in the top three positions. The results of our multi-dimension analysis indicate that LVLMs have moderate distance towards Artificial General Intelligence (AGI) and provide insights facilitating the development of multilingual LVLMs.
Paper Structure (18 sections, 6 figures, 3 tables)

This paper contains 18 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An example of math question in GAOKAO-MM. The English translation in the text and images is added by the author for readers' understanding.
  • Figure 2: Different Performance in Subjects.
  • Figure 3: Different Performance in Image Types.
  • Figure 4: Difference in Annual Trends. The light-colored lines represent the accuracy obtained from three tests, while the dark-colored line represents the average accuracy.
  • Figure 5: Distribution of Image Types in GAOKAO-MM
  • ...and 1 more figures