Table of Contents
Fetching ...

Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs

Shan Zhang, Aotian Chen, Yanpeng Sun, Jindong Gu, Yi-Yu Zheng, Piotr Koniusz, Kai Zou, Anton van den Hengel, Yuan Xue

TL;DR

This work identifies fine-grained geometric perception as a core bottleneck in multimodal mathematical reasoning and introduces SVE-Math, a geometry-aware framework that plugs a GeoGLIP visual encoder and a dynamic feature router into existing MLLMs. By grounding geometric primitives and selectively integrating visual cues into prompts, the method improves reasoning without requiring massive visual instruction datasets. Empirical results across MathVerse, GeoQA, and MathVista demonstrate that SVE-Math achieves strong performance with smaller visual-data budgets and is compatible with strong backbones like GPT-4V, highlighting the practical impact of improved visual grounding for MLLMs in mathematics.

Abstract

Current multimodal large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding. The limitation is largely attributable to inadequate perception of geometric primitives during image-level contrastive pre-training (e.g., CLIP). While recent efforts to improve math MLLMs have focused on scaling up mathematical visual instruction datasets and employing stronger LLM backbones, they often overlook persistent errors in visual recognition. In this paper, we systematically evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance, underscoring the critical role of fine-grained visual understanding. Notably, advanced models like GPT-4o exhibit a 70% error rate when identifying geometric entities, highlighting that this remains a key bottleneck in visual mathematical reasoning. To address this, we propose a novel approach, SVE-Math (Selective Vision-Enhanced Mathematical MLLM), featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps. Our model recognizes accurate visual primitives and generates precise visual prompts tailored to the language model's reasoning needs. In experiments, SVE-Math-Qwen2.5-7B outperforms other 7B models by 15% on MathVerse and is compatible with GPT-4V on MathVista. Despite being trained on smaller datasets, SVE-Math-7B achieves competitive performance on GeoQA, rivaling models trained on significantly larger datasets. Our findings emphasize the importance of incorporating fine-grained visual understanding into MLLMs and provide a promising direction for future research.

Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs

TL;DR

This work identifies fine-grained geometric perception as a core bottleneck in multimodal mathematical reasoning and introduces SVE-Math, a geometry-aware framework that plugs a GeoGLIP visual encoder and a dynamic feature router into existing MLLMs. By grounding geometric primitives and selectively integrating visual cues into prompts, the method improves reasoning without requiring massive visual instruction datasets. Empirical results across MathVerse, GeoQA, and MathVista demonstrate that SVE-Math achieves strong performance with smaller visual-data budgets and is compatible with strong backbones like GPT-4V, highlighting the practical impact of improved visual grounding for MLLMs in mathematics.

Abstract

Current multimodal large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding. The limitation is largely attributable to inadequate perception of geometric primitives during image-level contrastive pre-training (e.g., CLIP). While recent efforts to improve math MLLMs have focused on scaling up mathematical visual instruction datasets and employing stronger LLM backbones, they often overlook persistent errors in visual recognition. In this paper, we systematically evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance, underscoring the critical role of fine-grained visual understanding. Notably, advanced models like GPT-4o exhibit a 70% error rate when identifying geometric entities, highlighting that this remains a key bottleneck in visual mathematical reasoning. To address this, we propose a novel approach, SVE-Math (Selective Vision-Enhanced Mathematical MLLM), featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps. Our model recognizes accurate visual primitives and generates precise visual prompts tailored to the language model's reasoning needs. In experiments, SVE-Math-Qwen2.5-7B outperforms other 7B models by 15% on MathVerse and is compatible with GPT-4V on MathVista. Despite being trained on smaller datasets, SVE-Math-7B achieves competitive performance on GeoQA, rivaling models trained on significantly larger datasets. Our findings emphasize the importance of incorporating fine-grained visual understanding into MLLMs and provide a promising direction for future research.
Paper Structure (21 sections, 7 equations, 17 figures, 6 tables)

This paper contains 21 sections, 7 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Analysis of MLLMs' performance in mathematical visual reasoning tasks from GeoQA test set. GPT-4o misperceived visual information in approximately 70% of cases involving geometric entities (Fig. \ref{['fig:intro_subfig1']}). Providing optimal geometric information enhances model performance, while redundant visual cues lower top-1 accuracy—even below the baseline achieved with only textual questions. (Fig. \ref{['fig:intro_subfig3']}). Model performance is sensitive to the accuracy of visual cues and a significant decrease ( 13.6%) in GPT-4o's top-1 accuracy is observed when provided with inaccurate bounding box locations and shape names (Bbox+Shape) (Fig. \ref{['fig:intro_subfig2']}).
  • Figure 2: The diagram presents the architecture of SVE-Math, highlighting key innovations in the geometric-grounded vision encoder (GeoGLIP) and the feature router. Fine-grained visual understanding is achieved through a feature pyramid (attention maps displayed on the left), capturing hierarchical visual features ranging from geometry-rich to semantic-rich information. The feature router dynamically adjusts the contribution of these features to generate visual soft prompts. These prompts are then combined with CLIP visual tokens and textual inputs before being fed into the language model (LLM), enabling accurate visual perception and enhanced mathematical reasoning.
  • Figure 3: Process for generating synthetic data with box- and pixel-level annotations for training our GeoGLIP visual encoder. Each image contains geometric objects such as circles, rectangles, and alphanumeric text ('Text') with random strings of length 1 to 10 placed alongside geometric shapes. Refer to Fig. \ref{['supp:prorgam']} in the Appendix for a detailed flowchart of the generation pipeline.
  • Figure 4: Comparison of geometric numerical answer accuracies (%) on GeoQA.
  • Figure 5: Comparison of model performance on FunctionQA of MathVista.
  • ...and 12 more figures