Table of Contents
Fetching ...

GeoSketch: A Neural-Symbolic Approach to Geometric Multimodal Reasoning with Auxiliary Line Construction and Affine Transformation

Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu, Zhiyang Teng, Xiaozhang Liu, Hanmeng Liu

TL;DR

This work tackles Geometric Problem Solving (GPS) with multimodal inputs by introducing GeoSketch, a neural-symbolic framework that casts geometry reasoning as an interactive perception–reasoning–action loop. It blends a perception module that abstracts diagrams into logic forms, a neural-symbolic reasoning module that applies theorems to select executable sketch actions, and a sketch-action module that manipulates the diagram via auxiliary lines and affine transformations. The authors contribute a three-tier architecture, a 390-problem GeoSketch Benchmark, and a two-stage training pipeline (SFT with knowledge distillation followed by reinforcement learning with symbolic rewards) that yields state-of-the-art performance on the benchmark, even surpassing larger models in some settings. The approach also improves general geometric knowledge, showing gains on static geometry benchmarks, and offers efficiency advantages over code-generation-based methods, establishing a new foundation for dynamic, verifiable visuospatial reasoning in AI.

Abstract

Geometric Problem Solving (GPS) poses a unique challenge for Multimodal Large Language Models (MLLMs), requiring not only the joint interpretation of text and diagrams but also iterative visuospatial reasoning. While existing approaches process diagrams as static images, they lack the capacity for dynamic manipulation - a core aspect of human geometric reasoning involving auxiliary line construction and affine transformations. We present GeoSketch, a neural-symbolic framework that recasts geometric reasoning as an interactive perception-reasoning-action loop. GeoSketch integrates: (1) a Perception module that abstracts diagrams into structured logic forms, (2) a Symbolic Reasoning module that applies geometric theorems to decide the next deductive step, and (3) a Sketch Action module that executes operations such as drawing auxiliary lines or applying transformations, thereby updating the diagram in a closed loop. To train this agent, we develop a two-stage pipeline: supervised fine-tuning on 2,000 symbolic-curated trajectories followed by reinforcement learning with dense, symbolic rewards to enhance robustness and strategic exploration. To evaluate this paradigm, we introduce the GeoSketch Benchmark, a high-quality set of 390 geometry problems requiring auxiliary construction or affine transformations. Experiments on strong MLLM baselines demonstrate that GeoSketch significantly improves stepwise reasoning accuracy and problem-solving success over static perception methods. By unifying hierarchical decision-making, executable visual actions, and symbolic verification, GeoSketch advances multimodal reasoning from static interpretation to dynamic, verifiable interaction, establishing a new foundation for solving complex visuospatial problems.

GeoSketch: A Neural-Symbolic Approach to Geometric Multimodal Reasoning with Auxiliary Line Construction and Affine Transformation

TL;DR

This work tackles Geometric Problem Solving (GPS) with multimodal inputs by introducing GeoSketch, a neural-symbolic framework that casts geometry reasoning as an interactive perception–reasoning–action loop. It blends a perception module that abstracts diagrams into logic forms, a neural-symbolic reasoning module that applies theorems to select executable sketch actions, and a sketch-action module that manipulates the diagram via auxiliary lines and affine transformations. The authors contribute a three-tier architecture, a 390-problem GeoSketch Benchmark, and a two-stage training pipeline (SFT with knowledge distillation followed by reinforcement learning with symbolic rewards) that yields state-of-the-art performance on the benchmark, even surpassing larger models in some settings. The approach also improves general geometric knowledge, showing gains on static geometry benchmarks, and offers efficiency advantages over code-generation-based methods, establishing a new foundation for dynamic, verifiable visuospatial reasoning in AI.

Abstract

Geometric Problem Solving (GPS) poses a unique challenge for Multimodal Large Language Models (MLLMs), requiring not only the joint interpretation of text and diagrams but also iterative visuospatial reasoning. While existing approaches process diagrams as static images, they lack the capacity for dynamic manipulation - a core aspect of human geometric reasoning involving auxiliary line construction and affine transformations. We present GeoSketch, a neural-symbolic framework that recasts geometric reasoning as an interactive perception-reasoning-action loop. GeoSketch integrates: (1) a Perception module that abstracts diagrams into structured logic forms, (2) a Symbolic Reasoning module that applies geometric theorems to decide the next deductive step, and (3) a Sketch Action module that executes operations such as drawing auxiliary lines or applying transformations, thereby updating the diagram in a closed loop. To train this agent, we develop a two-stage pipeline: supervised fine-tuning on 2,000 symbolic-curated trajectories followed by reinforcement learning with dense, symbolic rewards to enhance robustness and strategic exploration. To evaluate this paradigm, we introduce the GeoSketch Benchmark, a high-quality set of 390 geometry problems requiring auxiliary construction or affine transformations. Experiments on strong MLLM baselines demonstrate that GeoSketch significantly improves stepwise reasoning accuracy and problem-solving success over static perception methods. By unifying hierarchical decision-making, executable visual actions, and symbolic verification, GeoSketch advances multimodal reasoning from static interpretation to dynamic, verifiable interaction, establishing a new foundation for solving complex visuospatial problems.

Paper Structure

This paper contains 31 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An visual geometric problem that require dynamic image manipulation
  • Figure 2: The three-tier GeoSketch architecture first uses its perception module to get the initial logic form and then refine the logic form until it gets a complete one. Then the neural-symbolic reasoning module tries to solve the question by drawing auxiliary lines, and the sketch-action module manipulates the image.
  • Figure 3: The auto correction mechanism in image generation.