Table of Contents
Fetching ...

MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

Kai Sun, Yushi Bai, Ji Qi, Lei Hou, Juanzi Li

TL;DR

MM-MATH introduces a fine-grained, process-aware benchmark for multimodal math reasoning in large models, combining outcome accuracy with automatic process analysis to identify first-step errors. It assembles 5,929 open-ended middle-school problems with visual contexts and metadata on difficulty, grade level, and knowledge points, enabling evaluation across multiple dimensions. The study reveals that current LMMs struggle with diagram interpretation and rely largely on textual cues, with a substantial gap to human performance. The dataset and its process-evaluation framework provide a targeted avenue for improving visual-math understanding and reasoning in multimodal systems.

Abstract

To advance the evaluation of multimodal math reasoning in large multimodal models (LMMs), this paper introduces a novel benchmark, MM-MATH. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. Unlike existing benchmarks relying on binary answer comparison, MM-MATH incorporates both outcome and process evaluations. Process evaluation employs LMM-as-a-judge to automatically analyze solution steps, identifying and categorizing errors into specific error types. Extensive evaluation of ten models on MM-MATH reveals significant challenges for existing LMMs, highlighting their limited utilization of visual information and struggles with higher-difficulty problems. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans. This highlights the challenging nature of our benchmark for existing models and the significant gap between the multimodal reasoning capabilities of current models and humans. Our process evaluation reveals that diagram misinterpretation is the most common error, accounting for more than half of the total error cases, underscoring the need for improved image comprehension in multimodal reasoning.

MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

TL;DR

MM-MATH introduces a fine-grained, process-aware benchmark for multimodal math reasoning in large models, combining outcome accuracy with automatic process analysis to identify first-step errors. It assembles 5,929 open-ended middle-school problems with visual contexts and metadata on difficulty, grade level, and knowledge points, enabling evaluation across multiple dimensions. The study reveals that current LMMs struggle with diagram interpretation and rely largely on textual cues, with a substantial gap to human performance. The dataset and its process-evaluation framework provide a targeted avenue for improving visual-math understanding and reasoning in multimodal systems.

Abstract

To advance the evaluation of multimodal math reasoning in large multimodal models (LMMs), this paper introduces a novel benchmark, MM-MATH. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. Unlike existing benchmarks relying on binary answer comparison, MM-MATH incorporates both outcome and process evaluations. Process evaluation employs LMM-as-a-judge to automatically analyze solution steps, identifying and categorizing errors into specific error types. Extensive evaluation of ten models on MM-MATH reveals significant challenges for existing LMMs, highlighting their limited utilization of visual information and struggles with higher-difficulty problems. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans. This highlights the challenging nature of our benchmark for existing models and the significant gap between the multimodal reasoning capabilities of current models and humans. Our process evaluation reveals that diagram misinterpretation is the most common error, accounting for more than half of the total error cases, underscoring the need for improved image comprehension in multimodal reasoning.
Paper Structure (29 sections, 1 equation, 15 figures, 4 tables)

This paper contains 29 sections, 1 equation, 15 figures, 4 tables.

Figures (15)

  • Figure 1: An overview of the MM-MATH benchmark design. The problems are classified along their difficulty, grade level, and knowledge point. We include both outcome evaluation and process evaluation to identify and attribute the error in model's reasoning process.
  • Figure 2: Knowledge point distribution of MM-MATH. Properties of Shapes refers to the characteristics of different shapes, Shape transformation investigates the deformation and movements of shapes, and Function refers to the mutual reasoning between algebraic expressions and graphs.
  • Figure 3: Example for four different types of errors in multimodal math reasoning.
  • Figure 4: Proportion of four types of errors in various LMMs, with diagram misinterpretation errors and reasoning errors constituting the majority.
  • Figure 5: Number of the first two errors in evaluated LMMs.
  • ...and 10 more figures