Table of Contents
Fetching ...

NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation

Weiming Wu, Jin Ye, Zi-kang Wang, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

TL;DR

NeSyGeo tackles the data scarcity and misalignment challenges in multimodal geometric reasoning by introducing a neuro-symbolic data-generation framework. It defines a Geo-DSL for plane geometry, paired with a bidirectional conversion pipeline and a two-stage CoT generator (Reasoner and Verifier) to produce valid Q&A and reasoning paths, then maps symbolic outputs to images and text via Painter and Translator with information orthogonality. The approach yields 100k labeled samples across NeSyGeo-Caption and NeSyGeo-CoT, plus a 2,668-sample NeSyGeo-Test benchmark, and demonstrates consistent improvements across MathVision, MathVerse, and GeoQA under RL and SFT, including cases where a $4$-B model exceeds an $8$-B sibling on geometric tasks. Overall, NeSyGeo provides high-quality, diverse, and numerically grounded multimodal geometric data that strengthens visual grounding and cross-modal reasoning in MLLMs, with reproducibility and public dataset release enabling broader advancement in geometric reasoning research.

Abstract

Obtaining large-scale, high-quality reasoning data is crucial for improving the geometric reasoning capabilities of multi-modal large language models (MLLMs). However, existing data generation methods, whether based on predefined tem plates or constrained symbolic provers, inevitably face diversity and numerical generalization limitations. To address these limitations, we propose NeSyGeo, a novel neuro-symbolic framework for generating geometric reasoning data. First, we propose a domain-specific language grounded in the entity-attributes-relations paradigm to comprehensively represent all components of plane geometry, along with generative actions defined within this symbolic space. We then design a symbolic-visual-text pipeline that synthesizes symbolic sequences, maps them to visual and textual representations and generates reasoning path with reverse search and forward validation. Based on this framework, we construct NeSyGeo CoT and NeSyGeo-Caption datasets, containing 100k samples, and release a new benchmark NeSyGeo-Test for evaluating geometric reasoning abilities in MLLMs. Experiments demonstrate that the proposal significantly and consistently improves the performance of multiple MLLMs under both reinforcement and supervised fine-tuning. With only 4k samples and two epochs of reinforcement fine-tuning, base models achieve improvements of up to +15.8% on MathVision, +8.4% on MathVerse, and +7.3% on GeoQA. Notably, a 4B model can be improved to outperform an 8B model from the same series on geometric reasoning tasks.s

NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation

TL;DR

NeSyGeo tackles the data scarcity and misalignment challenges in multimodal geometric reasoning by introducing a neuro-symbolic data-generation framework. It defines a Geo-DSL for plane geometry, paired with a bidirectional conversion pipeline and a two-stage CoT generator (Reasoner and Verifier) to produce valid Q&A and reasoning paths, then maps symbolic outputs to images and text via Painter and Translator with information orthogonality. The approach yields 100k labeled samples across NeSyGeo-Caption and NeSyGeo-CoT, plus a 2,668-sample NeSyGeo-Test benchmark, and demonstrates consistent improvements across MathVision, MathVerse, and GeoQA under RL and SFT, including cases where a -B model exceeds an -B sibling on geometric tasks. Overall, NeSyGeo provides high-quality, diverse, and numerically grounded multimodal geometric data that strengthens visual grounding and cross-modal reasoning in MLLMs, with reproducibility and public dataset release enabling broader advancement in geometric reasoning research.

Abstract

Obtaining large-scale, high-quality reasoning data is crucial for improving the geometric reasoning capabilities of multi-modal large language models (MLLMs). However, existing data generation methods, whether based on predefined tem plates or constrained symbolic provers, inevitably face diversity and numerical generalization limitations. To address these limitations, we propose NeSyGeo, a novel neuro-symbolic framework for generating geometric reasoning data. First, we propose a domain-specific language grounded in the entity-attributes-relations paradigm to comprehensively represent all components of plane geometry, along with generative actions defined within this symbolic space. We then design a symbolic-visual-text pipeline that synthesizes symbolic sequences, maps them to visual and textual representations and generates reasoning path with reverse search and forward validation. Based on this framework, we construct NeSyGeo CoT and NeSyGeo-Caption datasets, containing 100k samples, and release a new benchmark NeSyGeo-Test for evaluating geometric reasoning abilities in MLLMs. Experiments demonstrate that the proposal significantly and consistently improves the performance of multiple MLLMs under both reinforcement and supervised fine-tuning. With only 4k samples and two epochs of reinforcement fine-tuning, base models achieve improvements of up to +15.8% on MathVision, +8.4% on MathVerse, and +7.3% on GeoQA. Notably, a 4B model can be improved to outperform an 8B model from the same series on geometric reasoning tasks.s

Paper Structure

This paper contains 42 sections, 4 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Performance comparison of different MLLMs and LLMs with and without image input in several geometry datasets. The minimal or negligible drops observed upon image removal in GeoQA and R-CoT raise concerns regarding the utilization of visual information for geometric reasoning.
  • Figure 2: Comparison of dataset characteristics synthesized by our method and other popular synthesis approaches. "High Resolution" denotes average image pixels exceeding 336$\times$336. "Symbolic Form" refers to the symbolic meta-information associated with the image. "Classification of Elements" signifies categorization by geometric elements. "Visual Understanding" represents the mitigation of image-text redundancy for stronger visual grounding in reasoning. More specific examples of different methods are in Appendix \ref{['sec:append_comp']}.
  • Figure 3: Our pipeline is centered around a symbolic language Geo-DSL. The Generator synthesizes a sequence in this language, from which Reasoner and Verifier produce logically sound Q&A pairs and reasoning chains. Subsequently, Painter and Translator render the symbolic core into semantically orthogonal visual and textual outputs.
  • Figure 4: Human evaluation results comparison.
  • Figure 5: Efficiency comparison of our NeSyGeo-CoT dataset versus other mainstream automated synthesis datasets. The models are trained using RL methods with InternVL2.5-4B.
  • ...and 14 more figures