CapGeo: A Caption-Assisted Approach to Geometric Reasoning
Yuying Li, Siyi Qian, Hao Liang, Leqi Zheng, Ruichuan An, Yongzhen Guo, Wentao Zhang
TL;DR
This work tackles the gap in multimodal geometric reasoning by introducing CapGeo, a caption-assisted framework that transforms geometric diagrams into dense textual captions to improve reasoning in MLLMs. It couples CapGeo with CapGeo-Bench, a 4,641-image geometry captioning dataset and a keypoint-based evaluation method that correlates caption quality with downstream reasoning performance. Experiments across MathVerse, MathVista, and GeoQA show substantial gains when captions accompany visual input, with open-source models benefiting notably and captions approaching the performance of large closed models. The combination of CapGeo and CapGeo-Bench provides a practical pathway to robust geometric reasoning and a rigorous standard for captioning quality in geometric contexts.
Abstract
Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
