MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
Felix Chen, Hangjie Yuan, Yunqiu Xu, Tao Feng, Jun Cen, Pengwei Liu, Zeying Huang, Yi Yang
TL;DR
This work targets the bottleneck in visual mathematical problem-solving caused by imperfect diagram perception in multimodal LLMs. It introduces FlowVerse, a benchmark that decomposes problem information into $DI$, $EI$, $RP$, and $OQ$ across six variants to separately evaluate perception and reasoning, and proposes MathFlow, a modular pipeline that decouples perception from inference. A dedicated perception model, MathFlow-P-7B, is trained via multi-task pretraining on $EI$ and $RP$ caption tasks and refined with supervised fine-tuning, enabling flexible integration with various inference models. Empirical results show substantial performance gains when MathFlow is paired with different backbones and demonstrate the robustness of FlowVerse-CoT-E as an evaluation strategy, highlighting the critical role of accurate perceptual extraction for reliable visual mathematical reasoning. Overall, FlowVerse and MathFlow advance practical visual mathematics by separating perception and inference and enabling state-of-the-art reasoning through improved visual grounding.
Abstract
Despite impressive performance across diverse tasks, Multimodal Large Language Models (MLLMs) have yet to fully demonstrate their potential in visual mathematical problem-solving, particularly in accurately perceiving and interpreting diagrams. Inspired by typical processes of humans, we hypothesize that the perception capabilities to extract meaningful information from diagrams is crucial, as it directly impacts subsequent inference processes. To validate this hypothesis, we developed FlowVerse, a comprehensive benchmark that categorizes all information used during problem-solving into four components, which are then combined into six problem versions for evaluation. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned property from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility to diverse inference frameworks. The FlowVerse benchmark and code are available at https://github.com/MathFlow-zju/MathFlow.
