Table of Contents
Fetching ...

MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

Xiaoyuan Li, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu

TL;DR

MathOPEval introduces a fine-grained benchmark to evaluate MLLMs on visual operations expressed through code during mathematical reasoning. By separating Multi-modal Code Generation (MCG) and Multi-modal Code Editing (MCE) across five visualization types, the dataset enables precise assessment of intermediate visual operations, not just final answers. Experiments with nine MLLMs reveal substantial gaps to human performance, especially in MCE and function plots, and show that prompt strategy and model scale yield inconsistent gains. The work demonstrates the potential and limits of current MLLMs in visual reasoning and provides a concrete benchmark and evaluation framework to drive future improvements in visual perception and multi-modal programming capabilities.

Abstract

Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model's ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model's capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.

MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

TL;DR

MathOPEval introduces a fine-grained benchmark to evaluate MLLMs on visual operations expressed through code during mathematical reasoning. By separating Multi-modal Code Generation (MCG) and Multi-modal Code Editing (MCE) across five visualization types, the dataset enables precise assessment of intermediate visual operations, not just final answers. Experiments with nine MLLMs reveal substantial gaps to human performance, especially in MCE and function plots, and show that prompt strategy and model scale yield inconsistent gains. The work demonstrates the potential and limits of current MLLMs in visual reasoning and provides a concrete benchmark and evaluation framework to drive future improvements in visual perception and multi-modal programming capabilities.

Abstract

Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model's ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model's capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.

Paper Structure

This paper contains 21 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Comparison of the paradigm of (a) multi-modal input, text-only output and (b) multi-modal input, multi-modal output with four types of visual operations.
  • Figure 2: Illustration of initial dataset for four visual operations across five visualization types. The dataset includes evaluation instructions, code, and images constructed for different tasks and visualization types. Please refer to Section \ref{['dataset']} for details.
  • Figure 3: Distribution of five visualization types and their major context domains.
  • Figure 4: Basic statistics of MathOPEval.
  • Figure 5: Comparison between reasoning-enhanced and general-purpose models on multiple-choice format using Direct prompt.
  • ...and 10 more figures