MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning
Xiaoyuan Li, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu
TL;DR
MathOPEval introduces a fine-grained benchmark to evaluate MLLMs on visual operations expressed through code during mathematical reasoning. By separating Multi-modal Code Generation (MCG) and Multi-modal Code Editing (MCE) across five visualization types, the dataset enables precise assessment of intermediate visual operations, not just final answers. Experiments with nine MLLMs reveal substantial gaps to human performance, especially in MCE and function plots, and show that prompt strategy and model scale yield inconsistent gains. The work demonstrates the potential and limits of current MLLMs in visual reasoning and provides a concrete benchmark and evaluation framework to drive future improvements in visual perception and multi-modal programming capabilities.
Abstract
Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model's ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model's capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.
