Table of Contents
Fetching ...

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

Felix Chen, Hangjie Yuan, Yunqiu Xu, Tao Feng, Jun Cen, Pengwei Liu, Zeying Huang, Yi Yang

TL;DR

This work targets the bottleneck in visual mathematical problem-solving caused by imperfect diagram perception in multimodal LLMs. It introduces FlowVerse, a benchmark that decomposes problem information into $DI$, $EI$, $RP$, and $OQ$ across six variants to separately evaluate perception and reasoning, and proposes MathFlow, a modular pipeline that decouples perception from inference. A dedicated perception model, MathFlow-P-7B, is trained via multi-task pretraining on $EI$ and $RP$ caption tasks and refined with supervised fine-tuning, enabling flexible integration with various inference models. Empirical results show substantial performance gains when MathFlow is paired with different backbones and demonstrate the robustness of FlowVerse-CoT-E as an evaluation strategy, highlighting the critical role of accurate perceptual extraction for reliable visual mathematical reasoning. Overall, FlowVerse and MathFlow advance practical visual mathematics by separating perception and inference and enabling state-of-the-art reasoning through improved visual grounding.

Abstract

Despite impressive performance across diverse tasks, Multimodal Large Language Models (MLLMs) have yet to fully demonstrate their potential in visual mathematical problem-solving, particularly in accurately perceiving and interpreting diagrams. Inspired by typical processes of humans, we hypothesize that the perception capabilities to extract meaningful information from diagrams is crucial, as it directly impacts subsequent inference processes. To validate this hypothesis, we developed FlowVerse, a comprehensive benchmark that categorizes all information used during problem-solving into four components, which are then combined into six problem versions for evaluation. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned property from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility to diverse inference frameworks. The FlowVerse benchmark and code are available at https://github.com/MathFlow-zju/MathFlow.

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

TL;DR

This work targets the bottleneck in visual mathematical problem-solving caused by imperfect diagram perception in multimodal LLMs. It introduces FlowVerse, a benchmark that decomposes problem information into , , , and across six variants to separately evaluate perception and reasoning, and proposes MathFlow, a modular pipeline that decouples perception from inference. A dedicated perception model, MathFlow-P-7B, is trained via multi-task pretraining on and caption tasks and refined with supervised fine-tuning, enabling flexible integration with various inference models. Empirical results show substantial performance gains when MathFlow is paired with different backbones and demonstrate the robustness of FlowVerse-CoT-E as an evaluation strategy, highlighting the critical role of accurate perceptual extraction for reliable visual mathematical reasoning. Overall, FlowVerse and MathFlow advance practical visual mathematics by separating perception and inference and enabling state-of-the-art reasoning through improved visual grounding.

Abstract

Despite impressive performance across diverse tasks, Multimodal Large Language Models (MLLMs) have yet to fully demonstrate their potential in visual mathematical problem-solving, particularly in accurately perceiving and interpreting diagrams. Inspired by typical processes of humans, we hypothesize that the perception capabilities to extract meaningful information from diagrams is crucial, as it directly impacts subsequent inference processes. To validate this hypothesis, we developed FlowVerse, a comprehensive benchmark that categorizes all information used during problem-solving into four components, which are then combined into six problem versions for evaluation. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned property from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility to diverse inference frameworks. The FlowVerse benchmark and code are available at https://github.com/MathFlow-zju/MathFlow.

Paper Structure

This paper contains 25 sections, 1 equation, 19 figures, 13 tables.

Figures (19)

  • Figure 1: The Typical Process of Humans Solving Visual Mathematical Problems. We can summarize two key capabilities observed in the typical human problem-solving process: perception and inference. The perception capability involves extracting relevant information from both visual and textual inputs, ensuring accurate reasoning, which inspired the development of FlowVerse and MathFlow.
  • Figure 2: Six Versions of Problems in FlowVerse. FlowVerse begins by categorizing the original problem information into four distinct components: Descriptive Information (DI), Essential Information (EI), Only Question (OQ), and Reasoned Property (RP). The first three components are derived directly from the original problem statement, while RP is extracted from the solution and represents the inferences needed to solve the problem. In the Vision Centric version, we convert the EI into diagrams, while in the Vision Primary version, we convert both the EI and RP into diagrams.
  • Figure 3: The Overview of MathFlow Pipeline. To effectively train MLLMs for problem-solving, we decouple MLLMs into two sub-modules: the perception model and the inference model. The perception model is responsible for extracting and interpreting visual information, converting it into a form that can be effectively processed. The inference model uses this extracted information, along with the original question, to reason and derive solutions.
  • Figure 4: The FlowVerse-CoT-E Strategy.
  • Figure 5: Comparison of Two Different CoT Evaluation Performances on FlowVerse$^{\dagger}$.
  • ...and 14 more figures