Table of Contents
Fetching ...

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Yibo Yan, Shen Wang, Jiahao Huo, Hang Li, Boyan Li, Jiamin Su, Xiong Gao, Yi-Fan Zhang, Tianlong Xu, Zhendong Chu, Aoxiao Zhong, Kun Wang, Hui Xiong, Philip S. Yu, Xuming Hu, Qingsong Wen

TL;DR

The paper introduces ErrorRadar, a novel multimodal benchmark for error-detection in K-12 mathematics, filling a gap left by problem-solving-focused benchmarks. It formalizes two tasks—error step identification and error categorization—and builds a 2,500-instance dataset from real student interactions with rich multimodal inputs. Through extensive evaluation of open-source, closed-source, and human benchmarks, it reveals that current MLLMs struggle to reach human performance, with clear differences between visual-perception and higher-order reasoning errors and notable scaling effects. The dataset and findings highlight critical challenges in multimodal mathematical reasoning and establish a platform for advancing error-detection capabilities in educational AI systems.

Abstract

As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to revolutionize artificial intelligence is particularly promising, especially in addressing mathematical reasoning tasks. Current mathematical benchmarks predominantly focus on evaluating MLLMs' problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task: multimodal error detection, and introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization, providing a comprehensive framework for evaluating MLLMs' complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with rigorous annotation and rich metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation. The dataset will be available upon acceptance.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

TL;DR

The paper introduces ErrorRadar, a novel multimodal benchmark for error-detection in K-12 mathematics, filling a gap left by problem-solving-focused benchmarks. It formalizes two tasks—error step identification and error categorization—and builds a 2,500-instance dataset from real student interactions with rich multimodal inputs. Through extensive evaluation of open-source, closed-source, and human benchmarks, it reveals that current MLLMs struggle to reach human performance, with clear differences between visual-perception and higher-order reasoning errors and notable scaling effects. The dataset and findings highlight critical challenges in multimodal mathematical reasoning and establish a platform for advancing error-detection capabilities in educational AI systems.

Abstract

As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to revolutionize artificial intelligence is particularly promising, especially in addressing mathematical reasoning tasks. Current mathematical benchmarks predominantly focus on evaluating MLLMs' problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task: multimodal error detection, and introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization, providing a comprehensive framework for evaluating MLLMs' complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with rigorous annotation and rich metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation. The dataset will be available upon acceptance.
Paper Structure (25 sections, 9 equations, 19 figures, 4 tables)

This paper contains 25 sections, 9 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Comparison of research scope between previous work and our proposed ErrorRadar benchmark on mathematical reasoning tasks.
  • Figure 2: Example of our well-annotated multimodal mathematical reasoning dataset ErrorRadar, and performance comparison on error categorization and error step localization tasks among representative MLLMs. It is evident that even simple math problems can be mishandled by the currently superior MLLMs in one or both tasks, highlighting the challenging nature of our proposed multimodal error detection setting.
  • Figure 2: Key statistics of ErrorRadar.
  • Figure 3: Roadmap of ErrorRadar dataset collection, annotation, and consistent update.
  • Figure 4: Dataset distribution of ErrorRadar with respect to problem type and error category.
  • ...and 14 more figures