Table of Contents
Fetching ...

CapGeo: A Caption-Assisted Approach to Geometric Reasoning

Yuying Li, Siyi Qian, Hao Liang, Leqi Zheng, Ruichuan An, Yongzhen Guo, Wentao Zhang

TL;DR

This work tackles the gap in multimodal geometric reasoning by introducing CapGeo, a caption-assisted framework that transforms geometric diagrams into dense textual captions to improve reasoning in MLLMs. It couples CapGeo with CapGeo-Bench, a 4,641-image geometry captioning dataset and a keypoint-based evaluation method that correlates caption quality with downstream reasoning performance. Experiments across MathVerse, MathVista, and GeoQA show substantial gains when captions accompany visual input, with open-source models benefiting notably and captions approaching the performance of large closed models. The combination of CapGeo and CapGeo-Bench provides a practical pathway to robust geometric reasoning and a rigorous standard for captioning quality in geometric contexts.

Abstract

Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.

CapGeo: A Caption-Assisted Approach to Geometric Reasoning

TL;DR

This work tackles the gap in multimodal geometric reasoning by introducing CapGeo, a caption-assisted framework that transforms geometric diagrams into dense textual captions to improve reasoning in MLLMs. It couples CapGeo with CapGeo-Bench, a 4,641-image geometry captioning dataset and a keypoint-based evaluation method that correlates caption quality with downstream reasoning performance. Experiments across MathVerse, MathVista, and GeoQA show substantial gains when captions accompany visual input, with open-source models benefiting notably and captions approaching the performance of large closed models. The combination of CapGeo and CapGeo-Bench provides a practical pathway to robust geometric reasoning and a rigorous standard for captioning quality in geometric contexts.

Abstract

Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.

Paper Structure

This paper contains 24 sections, 4 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Bad Cases in Mathverse. The upper shows the incorrect results of MLLMs. Mismatch means generated relationship exists, but the relationship subjects mismatch. After caption assistance, GPT-o3 reasons correctly.
  • Figure 1: CapGeo-Bench Statistics
  • Figure 2: Overview of CapGeo-Bench. AG: Analytic Geometry, PG: Plane Geometry, SG: Solid Geometry
  • Figure 3: CapGeo-Bench Data Statistics Chart
  • Figure 4: Overview of CapGeo-Bench Evaluation. Covered items in keypoints are marked by italics and underlining.
  • ...and 5 more figures