Table of Contents
Fetching ...

GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving

Linger Deng, Yuliang Liu, Wenwen Yu, Zujia Zhang, Jianzhong Ju, Zhenbo Luo, Xiang Bai

TL;DR

GeoFocus addresses the geometry problem solving challenge by introducing two complementary perception modules: a Critical Local Perceptor that foregrounds thirteen theory-based local cues through perception Q&A templates, and VertexLang, a compact topology language for efficient global topology reconstruction. The two-stage framework achieves stronger GPS performance, with a reported average improvement of $4.7\%$ over leading specialized models and improved robustness on diverse visual conditions, while also cutting topology-training time by $20\%$ and expanding local cue coverage by $61\%$. Experiments across Geo3K, GeoQA, FormalGeo7K, and out-of-domain benchmarks demonstrate both in-domain gains and robust generalization, supported by ablations and order-of-operations analyses that favor a global-to-local learning sequence. Overall, GeoFocus offers a scalable, data-efficient pathway to fuse global topology understanding with rich local geometric reasoning, with potential to integrate with external theorem libraries and extend to 3D geometry.”

Abstract

Geometry problem-solving remains a significant challenge for Large Multimodal Models (LMMs), requiring not only global shape recognition but also attention to intricate local relationships related to geometric theory. To address this, we propose GeoFocus, a novel framework comprising two core modules. 1) Critical Local Perceptor, which automatically identifies and emphasizes critical local structure (e.g., angles, parallel lines, comparative distances) through thirteen theory-based perception templates, boosting critical local feature coverage by 61% compared to previous methods. 2) VertexLang, a compact topology formal language, encodes global figures through vertex coordinates and connectivity relations. By replacing bulky code-based encodings, VertexLang reduces global perception training time by 20% while improving topology recognition accuracy. When evaluated in Geo3K, GeoQA, and FormalGeo7K, GeoFocus achieves a 4.7% accuracy improvement over leading specialized models and demonstrates superior robustness in MATHVERSE under diverse visual conditions. Project Page -- https://github.com/dle666/GeoFocus

GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving

TL;DR

GeoFocus addresses the geometry problem solving challenge by introducing two complementary perception modules: a Critical Local Perceptor that foregrounds thirteen theory-based local cues through perception Q&A templates, and VertexLang, a compact topology language for efficient global topology reconstruction. The two-stage framework achieves stronger GPS performance, with a reported average improvement of over leading specialized models and improved robustness on diverse visual conditions, while also cutting topology-training time by and expanding local cue coverage by . Experiments across Geo3K, GeoQA, FormalGeo7K, and out-of-domain benchmarks demonstrate both in-domain gains and robust generalization, supported by ablations and order-of-operations analyses that favor a global-to-local learning sequence. Overall, GeoFocus offers a scalable, data-efficient pathway to fuse global topology understanding with rich local geometric reasoning, with potential to integrate with external theorem libraries and extend to 3D geometry.”

Abstract

Geometry problem-solving remains a significant challenge for Large Multimodal Models (LMMs), requiring not only global shape recognition but also attention to intricate local relationships related to geometric theory. To address this, we propose GeoFocus, a novel framework comprising two core modules. 1) Critical Local Perceptor, which automatically identifies and emphasizes critical local structure (e.g., angles, parallel lines, comparative distances) through thirteen theory-based perception templates, boosting critical local feature coverage by 61% compared to previous methods. 2) VertexLang, a compact topology formal language, encodes global figures through vertex coordinates and connectivity relations. By replacing bulky code-based encodings, VertexLang reduces global perception training time by 20% while improving topology recognition accuracy. When evaluated in Geo3K, GeoQA, and FormalGeo7K, GeoFocus achieves a 4.7% accuracy improvement over leading specialized models and demonstrates superior robustness in MATHVERSE under diverse visual conditions. Project Page -- https://github.com/dle666/GeoFocus
Paper Structure (31 sections, 19 equations, 14 figures, 12 tables, 1 algorithm)

This paper contains 31 sections, 19 equations, 14 figures, 12 tables, 1 algorithm.

Figures (14)

  • Figure 1: The existing two paradigms for enhancing geometry perception of LMMs: image reconstruction and perception Q&A data generation.
  • Figure 2: Local structure with theoretical applicability is the key to geometry problem solving.
  • Figure 3: Existing methods for improving GPS ability through visual perception, including formal language-based methods (a, b), image reconstruction-based approaches (c, d), and perception Q&A-based approaches (e, f). We propose GeoFocus (g), which enhances visual perception through global topology reconstruction and local perception Q&A training. 'Percep.’ short for Perception.
  • Figure 4: Overview of the GeoFocus. Critical Local Perceptor focuses on the critical local structures required for GPS through perception Q&A training, laying the foundation for accurate reasoning. VertexLang Topology Percepter improves the model's understanding of topology structures through VertexLang-based Image Reconstruction training. 'Percep.' short for Perception.
  • Figure 5: Critical local structure type count percentage distribution in the classic GPS Q&A pairs.
  • ...and 9 more figures