GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation
Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, Jing Zhang
TL;DR
This work tackles the challenge of solid geometry reasoning in vision-language models, where auxiliary-line constructions are often essential but hard to render precisely. It introduces a two-stage learning framework (SFT on chain-of-thought data followed by GRPO-based reinforcement learning) guided by a geometry-aware cross-modal reward that compares textual auxiliary-line descriptions with ground-truth diagram structures, trained on the AuxSolidMath dataset to produce GeoVLMath. The approach yields competitive performance against substantially larger open-source and closed-source LVLMs on auxiliary-line reasoning benchmarks, demonstrating that geometry-grounded supervision can surpass mere model scaling. By providing AuxSolidMath and a reusable cross-modal RL paradigm for diagram–text alignment, the work offers a practical route toward robust, scalable geometry reasoning in LVLMs, with potential for future rendering-based enhancements of auxiliary lines.
Abstract
Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Rather than editing diagrams to draw auxiliary lines, which current image editing models struggle to render with geometric precision, we generate textual descriptions of auxiliary-line constructions to better align with the representational strengths of LVLMs. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. At the core of our approach is a cross-modal reward that evaluates how well the generated auxiliary-line description for an original diagram matches a ground-truth auxiliary-line diagram. Built on this reward, we present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry. This fine-grained signal drives a GRPO-based RL stage, yielding precise diagram-text alignment. To support training, we develop a scalable data creation pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. At the 3B and 7B scales, GeoVLMath achieves competitive and often superior performance compared with strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks.
