Table of Contents
Fetching ...

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, Kaipeng Zhang

TL;DR

MDK12-Bench introduces a large-scale, multi-disciplinary benchmark for evaluating high-order multimodal reasoning in Multimodal LLMs, spanning six disciplines with 140K reasoning instances and structured knowledge points across cross-year partitions. It couples a rigorous data curation pipeline with a six-level knowledge tree and a dynamic evaluation framework that bootstraps both text and image inputs to mitigate data contamination. Experimental results across multiple closed- and open-source systems show that larger, reasoning-focused models perform better but still struggle with cross-domain reasoning and robustness under perturbations, highlighting persistent gaps in multimodal reasoning capabilities. The open data and dynamic testing framework provide a robust platform for ongoing evaluation and targeted improvements toward more resilient and generalizable multimodal reasoning.

Abstract

Multimodal reasoning, which integrates language and visual cues into problem solving and decision making, is a fundamental aspect of human intelligence and a crucial step toward artificial general intelligence. However, the evaluation of multimodal reasoning capabilities in Multimodal Large Language Models (MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained by limited data size, narrow domain coverage, and unstructured knowledge distribution. To close these gaps, we introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Spanning six disciplines (math, physics, chemistry, biology, geography, and information science), our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions, providing a robust platform for comprehensive evaluation. Additionally, we present a novel dynamic evaluation framework to mitigate data contamination issues by bootstrapping question forms, question types, and image styles during evaluation. Extensive experiment on MDK12-Bench reveals the significant limitation of current MLLMs in multimodal reasoning. The findings on our benchmark provide insights into the development of the next-generation models. Our data and codes are available at https://github.com/LanceZPF/MDK12.

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

TL;DR

MDK12-Bench introduces a large-scale, multi-disciplinary benchmark for evaluating high-order multimodal reasoning in Multimodal LLMs, spanning six disciplines with 140K reasoning instances and structured knowledge points across cross-year partitions. It couples a rigorous data curation pipeline with a six-level knowledge tree and a dynamic evaluation framework that bootstraps both text and image inputs to mitigate data contamination. Experimental results across multiple closed- and open-source systems show that larger, reasoning-focused models perform better but still struggle with cross-domain reasoning and robustness under perturbations, highlighting persistent gaps in multimodal reasoning capabilities. The open data and dynamic testing framework provide a robust platform for ongoing evaluation and targeted improvements toward more resilient and generalizable multimodal reasoning.

Abstract

Multimodal reasoning, which integrates language and visual cues into problem solving and decision making, is a fundamental aspect of human intelligence and a crucial step toward artificial general intelligence. However, the evaluation of multimodal reasoning capabilities in Multimodal Large Language Models (MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained by limited data size, narrow domain coverage, and unstructured knowledge distribution. To close these gaps, we introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Spanning six disciplines (math, physics, chemistry, biology, geography, and information science), our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions, providing a robust platform for comprehensive evaluation. Additionally, we present a novel dynamic evaluation framework to mitigate data contamination issues by bootstrapping question forms, question types, and image styles during evaluation. Extensive experiment on MDK12-Bench reveals the significant limitation of current MLLMs in multimodal reasoning. The findings on our benchmark provide insights into the development of the next-generation models. Our data and codes are available at https://github.com/LanceZPF/MDK12.

Paper Structure

This paper contains 14 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of MDK12-Bench. It comprises 140K instances and spans 6 disciplines in K-12 education. The knowledge system of our bench is structured into six fine-grained levels: discipline, grade, curriculum, topic, meta-knowledge, and key knowledge points, where the three rings showcase the first three levels. Examples illustrate the representative grades (elementary, middle, and high schools), difficulty levels (easy, medium, and high), questions and answers, and key knowledge points of each discipline. The diverse question forms (single- and multiple-choice, open-ended question, fill-in-blank, true-or-false) and detailed answer explanations are also demonstrated.
  • Figure 2: Data curation and statistics of our MDK12-Bench. (a) The data curation pipeline consists of four stages: data collection, screening, parsing, and processing. The knowledge in our benchmark is structured into six interconnected levels: Level 1 - discipline, Level 2 - grade, Level 3 - curriculum, Level 4 - topics, Level 5 - meta-knowledge, and Level 6 - key knowledge point. Statistics of knowledge coverage of our bench is illustrated in terms of the number of instance occurrences at (b) discipline and grade levels; (c) high-school curriculum level; and (d) elementary- and middle-school curriculum level. Examples of curriculum-level knowledge points are also demonstrated.
  • Figure 3: The proposed dynamic MLLMs evaluation pipeline. It includes an image and a text bootstrapping module to mitigate data contamination and a two-stage answer evaluation module comparing the model answers with ground truth.
  • Figure 4: Knowledge points (Level 5 - Meta Knowledge) ranked by mean accuracy of Gemini2-thinking on MDK12-Mini dataset.
  • Figure 5: Breakdown of accuracy on MDK12-Mini across different exam years.
  • ...and 3 more figures