Table of Contents
Fetching ...

GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning

Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, Bo Zheng

TL;DR

GeoSense tackles geometry problem-solving by introducing a bilingual benchmark that explicitly evaluates both identification and application of geometric principles in multimodal contexts. The framework reorganizes knowledge into a 148-principle, five-level hierarchy spanning plane and solid geometry, paired with a finely annotated dataset of $1{,}789$ problems and $5{,}556$ principle–diagram alignments. It defines two novel metrics, Geometry Principle Identification (GPI) and Geometry Principle Application (GPA), plus a final accuracy measure to comprehensively assess GPS reasoning in MLLMs. Comprehensive experiments show that while some models excel at computation, identifying and correctly applying geometric principles—especially in plane geometry—remains a bottleneck, underscoring GeoSense’s utility for diagnosing and guiding future improvements in human-like geometric reasoning. The work highlights actionable insights for enhancing MLLMs’ diagram-grounded reasoning in GPS tasks and establishes a benchmark with clear, interpretable signals for progress.

Abstract

Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in MLLMs, remaining a critical gap in assessing their ability to tackle GPS. To this end, we introduce GeoSense, the first comprehensive bilingual benchmark designed to systematically evaluate the geometric reasoning abilities of MLLMs through the lens of geometric principles. GeoSense features a five-level hierarchical framework of geometric principles spanning plane and solid geometry, an intricately annotated dataset of 1,789 problems, and an innovative evaluation strategy. Through extensive experiments on GeoSense with various open-source and closed-source MLLMs, we observe that Gemini-2.0-pro-flash performs best, achieving an overall score of $65.3$. Our in-depth analysis reveals that the identification and application of geometric principles remain a bottleneck for leading MLLMs, jointly hindering their reasoning abilities. These findings underscore GeoSense's potential to guide future advancements in MLLMs' geometric reasoning capabilities, paving the way for more robust and human-like reasoning in artificial intelligence.

GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning

TL;DR

GeoSense tackles geometry problem-solving by introducing a bilingual benchmark that explicitly evaluates both identification and application of geometric principles in multimodal contexts. The framework reorganizes knowledge into a 148-principle, five-level hierarchy spanning plane and solid geometry, paired with a finely annotated dataset of problems and principle–diagram alignments. It defines two novel metrics, Geometry Principle Identification (GPI) and Geometry Principle Application (GPA), plus a final accuracy measure to comprehensively assess GPS reasoning in MLLMs. Comprehensive experiments show that while some models excel at computation, identifying and correctly applying geometric principles—especially in plane geometry—remains a bottleneck, underscoring GeoSense’s utility for diagnosing and guiding future improvements in human-like geometric reasoning. The work highlights actionable insights for enhancing MLLMs’ diagram-grounded reasoning in GPS tasks and establishes a benchmark with clear, interpretable signals for progress.

Abstract

Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in MLLMs, remaining a critical gap in assessing their ability to tackle GPS. To this end, we introduce GeoSense, the first comprehensive bilingual benchmark designed to systematically evaluate the geometric reasoning abilities of MLLMs through the lens of geometric principles. GeoSense features a five-level hierarchical framework of geometric principles spanning plane and solid geometry, an intricately annotated dataset of 1,789 problems, and an innovative evaluation strategy. Through extensive experiments on GeoSense with various open-source and closed-source MLLMs, we observe that Gemini-2.0-pro-flash performs best, achieving an overall score of . Our in-depth analysis reveals that the identification and application of geometric principles remain a bottleneck for leading MLLMs, jointly hindering their reasoning abilities. These findings underscore GeoSense's potential to guide future advancements in MLLMs' geometric reasoning capabilities, paving the way for more robust and human-like reasoning in artificial intelligence.

Paper Structure

This paper contains 30 sections, 3 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Humans solve geometric problems by first identifying the relevant geometric principles and then applying them to derive solutions.
  • Figure 2: MLLMs encounter failures in GPS: Qwen2-VL-7B fails to identify the correct principle and GPT-4o struggles to apply principles to solve questions.
  • Figure 3: Diagram of the top-3 levels of geometric principles (5 levels in total). See details in Appendix 4.1.
  • Figure 4: Illustration of GenSense evaluation strategy. MLLMs are assessed through three aspects: identification (i.e., GPI), applications (i.e., GPA) of geometric principles, and final answer accuracy.
  • Figure 5: The performance of (a) Closed-sourced and (b) Open-sourced MLLMs on problems with different number of geometric principles.
  • ...and 8 more figures