Table of Contents
Fetching ...

Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code

Haobo Lin, Tianyi Bai, Chen Chen, Jiajun Zhang, Bohan Zeng, Wentao Zhang, Binhang Yuan

TL;DR

A pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named GeoCode, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images.

Abstract

Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference, yet current vision--language models struggle with complex geometric constructions due to limited training data and weak visual--symbolic alignment. We propose a pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named \textbf{GeoCode}, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images. Leveraging the plotting code provided in GeoCode, we further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task. GeoCode exhibits substantially higher structural complexity and reasoning difficulty than existing benchmarks, while maintaining mathematical correctness through multi-stage validation. Extensive experiments show that models trained on GeoCode achieve consistent improvements on multiple geometry benchmarks, demonstrating both the effectiveness of the dataset and the proposed alignment strategy. The code will be available at https://github.com/would1920/GeoCode.

Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code

TL;DR

A pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named GeoCode, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images.

Abstract

Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference, yet current vision--language models struggle with complex geometric constructions due to limited training data and weak visual--symbolic alignment. We propose a pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named \textbf{GeoCode}, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images. Leveraging the plotting code provided in GeoCode, we further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task. GeoCode exhibits substantially higher structural complexity and reasoning difficulty than existing benchmarks, while maintaining mathematical correctness through multi-stage validation. Extensive experiments show that models trained on GeoCode achieve consistent improvements on multiple geometry benchmarks, demonstrating both the effectiveness of the dataset and the proposed alignment strategy. The code will be available at https://github.com/would1920/GeoCode.
Paper Structure (75 sections, 6 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 75 sections, 6 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Motivation of our work. We address the limitations of implicit visual--symbolic alignment and low structural diversity and low problem difficulty in current geometry datasets by synthesizing problems from scratch and supervising models with plotting code for explicit structural grounding.
  • Figure 2: Overview of the data generation pipeline. Our framework factorizes synthesis into three verifiable stages: (1) Seed Generation for symbolic relational structures, (2) Instantiation for numerical grounding and meta-code generation, and (3) Visualization for diagram rendering and textual debiasing.
  • Figure 3: Plotting code as explicit alignment. Instead of relying on lossy linguistic descriptions, we utilize structured plotting code as an intermediate representation to couple visual perception with symbolic reasoning.
  • Figure 4: Exemplary data generation and model inference. The left panel illustrates the end-to-end synthesis process: from Symbolic Seed to Template-based Translation, followed by the Numerical Question, its corresponding Plotting Code, the Debiased Problem statement, and the finally rendered Diagram. The right panel showcases a real-world Inference Example, demonstrating the model's ability to perform structured reasoning and code-based grounding during testing.
  • Figure 5: Solving accuracy and segment-level structural recovery quality on Test-mini. Samples are grouped into four equal-sized bins according to segment-level F1 between predicted and ground-truth plotting code.
  • ...and 5 more figures