Table of Contents
Fetching ...

MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

Zhen Yang, Jinhao Chen, Zhengxiao Du, Wenmeng Yu, Weihan Wang, Wenyi Hong, Zhihuan Jiang, Bin Xu, Jie Tang

TL;DR

This work addresses the limited use of visual information in mathematical reasoning by introducing MathVL, a diverse fine-tuning dataset combining open-source math datasets with a large Chinese K12 subset, and by developing MathGLM-Vision, a family of multi-modal mathematical LLMs fine-tuned on MathVL. Through SFT on backbone models, the authors achieve substantial gains across public benchmarks (MathVista, MathVerse, Math-Vision) and a curated MathVL-test, with the 32B variant often outperforming strong baselines including GPT-4V on challenging tasks. The results demonstrate the value of dataset diversity—especially Chinese data and VQA diversity—for improving both domain-specific reasoning and general vision-language capabilities. The work highlights practical implications for complex visual-mathematical problem solving in education and research settings, and points to future improvements in reasoning and perception components.

Abstract

Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.

MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

TL;DR

This work addresses the limited use of visual information in mathematical reasoning by introducing MathVL, a diverse fine-tuning dataset combining open-source math datasets with a large Chinese K12 subset, and by developing MathGLM-Vision, a family of multi-modal mathematical LLMs fine-tuned on MathVL. Through SFT on backbone models, the authors achieve substantial gains across public benchmarks (MathVista, MathVerse, Math-Vision) and a curated MathVL-test, with the 32B variant often outperforming strong baselines including GPT-4V on challenging tasks. The results demonstrate the value of dataset diversity—especially Chinese data and VQA diversity—for improving both domain-specific reasoning and general vision-language capabilities. The work highlights practical implications for complex visual-mathematical problem solving in education and research settings, and points to future improvements in reasoning and perception components.

Abstract

Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.
Paper Structure (22 sections, 19 figures, 15 tables)

This paper contains 22 sections, 19 figures, 15 tables.

Figures (19)

  • Figure 1: Insight experiments demonstrates the significance of visual information in solving mathematical problems. (Left) A performance comparison of different models with and without visual inputs. (Right) The accuracies of MathGLM-Vision on MathVL-test with and without visual inputs.
  • Figure 2: Performance comparison of the different multi-modal large language models. (Left) The accuracies of MathGLM-Vision and other MLLMs among three evaluation datasets. (Right) The accuracy of MathGLM-Vision and other MLLMs on MathVL-test across different categories.
  • Figure 3: Analysis of answer lengths in several open-source mathematical datasets like MathV360K, GeoGPT4V, and Geometry3K.
  • Figure 4: Examples sampled from the constructed Chinese dataset.
  • Figure 5: Error distribution of MathGLM- Vision-32B.
  • ...and 14 more figures