Table of Contents
Fetching ...

GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation

Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, Jing Zhang

TL;DR

This work tackles the challenge of solid geometry reasoning in vision-language models, where auxiliary-line constructions are often essential but hard to render precisely. It introduces a two-stage learning framework (SFT on chain-of-thought data followed by GRPO-based reinforcement learning) guided by a geometry-aware cross-modal reward that compares textual auxiliary-line descriptions with ground-truth diagram structures, trained on the AuxSolidMath dataset to produce GeoVLMath. The approach yields competitive performance against substantially larger open-source and closed-source LVLMs on auxiliary-line reasoning benchmarks, demonstrating that geometry-grounded supervision can surpass mere model scaling. By providing AuxSolidMath and a reusable cross-modal RL paradigm for diagram–text alignment, the work offers a practical route toward robust, scalable geometry reasoning in LVLMs, with potential for future rendering-based enhancements of auxiliary lines.

Abstract

Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Rather than editing diagrams to draw auxiliary lines, which current image editing models struggle to render with geometric precision, we generate textual descriptions of auxiliary-line constructions to better align with the representational strengths of LVLMs. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. At the core of our approach is a cross-modal reward that evaluates how well the generated auxiliary-line description for an original diagram matches a ground-truth auxiliary-line diagram. Built on this reward, we present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry. This fine-grained signal drives a GRPO-based RL stage, yielding precise diagram-text alignment. To support training, we develop a scalable data creation pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. At the 3B and 7B scales, GeoVLMath achieves competitive and often superior performance compared with strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks.

GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation

TL;DR

This work tackles the challenge of solid geometry reasoning in vision-language models, where auxiliary-line constructions are often essential but hard to render precisely. It introduces a two-stage learning framework (SFT on chain-of-thought data followed by GRPO-based reinforcement learning) guided by a geometry-aware cross-modal reward that compares textual auxiliary-line descriptions with ground-truth diagram structures, trained on the AuxSolidMath dataset to produce GeoVLMath. The approach yields competitive performance against substantially larger open-source and closed-source LVLMs on auxiliary-line reasoning benchmarks, demonstrating that geometry-grounded supervision can surpass mere model scaling. By providing AuxSolidMath and a reusable cross-modal RL paradigm for diagram–text alignment, the work offers a practical route toward robust, scalable geometry reasoning in LVLMs, with potential for future rendering-based enhancements of auxiliary lines.

Abstract

Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Rather than editing diagrams to draw auxiliary lines, which current image editing models struggle to render with geometric precision, we generate textual descriptions of auxiliary-line constructions to better align with the representational strengths of LVLMs. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. At the core of our approach is a cross-modal reward that evaluates how well the generated auxiliary-line description for an original diagram matches a ground-truth auxiliary-line diagram. Built on this reward, we present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry. This fine-grained signal drives a GRPO-based RL stage, yielding precise diagram-text alignment. To support training, we develop a scalable data creation pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. At the 3B and 7B scales, GeoVLMath achieves competitive and often superior performance compared with strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks.

Paper Structure

This paper contains 35 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Pass@1 of Qwen2.5-VL-72B-Instruct and Gemini-2.5-Flash. "Aux" denotes auxiliary-line description.
  • Figure 2: Overview of the cross-modal reward-driven RL. We first fine-tune a cross-modal reward model on curated high-quality data to evaluate the correctness of auxiliary-line constructions. During the RL phase, the reward model’s consistency score is combined with a final-answer accuracy reward to produce a composite signal that updates the policy via GRPO.
  • Figure 3: Overview of the Proposed Data Creation Pipeline.
  • Figure 4: An Example from the AuxSolidMath Dataset.
  • Figure 5: Comparison of two representative image editing models for constructing a three-dimensional Cartesian coordinate system.
  • ...and 4 more figures