Table of Contents
Fetching ...

GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs

Shixian Luo, Zezhou Zhu, Yu Yuan, Yuncheng Yang, Lianlei Shan, Yong Wu

TL;DR

GeoGramBench formalizes Program-to-Geometry, introducing a dataset of $500$ geometry problems with explicit procedural drawing code to probe LLMs' ability to translate code into geometric representations and perform spatial reasoning. Across $17$ frontier models, results show strong performance on simple primitives but sharp declines for local relations and global abstractions, with the best global abstraction accuracy around $43.35\%$ and no model exceeding $50\%$. The authors implement a leakage-mitigated data collection and three-level taxonomy to enable fine-grained analysis and demonstrate drawing language (Asymptote vs Matplotlib) has negligible impact on performance. They propose a multi-stage internal geometry reasoning process and highlight the need for architecture and training strategies that enhance symbolic-to-geometric abstraction. GeoGramBench thus provides a robust, reusable benchmark to drive progress in symbolic-to-geometric understanding in LLMs.

Abstract

Geometric spatial reasoning forms the foundation of many applications in artificial intelligence, yet the ability of large language models (LLMs) to operate over geometric spatial information expressed in procedural code remains underexplored. In this paper, we address this gap by formalizing the Program-to-Geometry task, which challenges models to translate programmatic drawing code into accurate and abstract geometric reasoning. To evaluate this capability, we present GeoGramBench, a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity. Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies: even the most advanced models achieve less than 50% accuracy at the highest abstraction level. These results highlight the unique challenges posed by program-driven spatial reasoning and establish GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning. Project page: https://github.com/LiAuto-DSR/GeoGramBench.

GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs

TL;DR

GeoGramBench formalizes Program-to-Geometry, introducing a dataset of geometry problems with explicit procedural drawing code to probe LLMs' ability to translate code into geometric representations and perform spatial reasoning. Across frontier models, results show strong performance on simple primitives but sharp declines for local relations and global abstractions, with the best global abstraction accuracy around and no model exceeding . The authors implement a leakage-mitigated data collection and three-level taxonomy to enable fine-grained analysis and demonstrate drawing language (Asymptote vs Matplotlib) has negligible impact on performance. They propose a multi-stage internal geometry reasoning process and highlight the need for architecture and training strategies that enhance symbolic-to-geometric abstraction. GeoGramBench thus provides a robust, reusable benchmark to drive progress in symbolic-to-geometric understanding in LLMs.

Abstract

Geometric spatial reasoning forms the foundation of many applications in artificial intelligence, yet the ability of large language models (LLMs) to operate over geometric spatial information expressed in procedural code remains underexplored. In this paper, we address this gap by formalizing the Program-to-Geometry task, which challenges models to translate programmatic drawing code into accurate and abstract geometric reasoning. To evaluate this capability, we present GeoGramBench, a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity. Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies: even the most advanced models achieve less than 50% accuracy at the highest abstraction level. These results highlight the unique challenges posed by program-driven spatial reasoning and establish GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning. Project page: https://github.com/LiAuto-DSR/GeoGramBench.

Paper Structure

This paper contains 37 sections, 4 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Overview and performance analysis on text-only ($\mathbb{P}_T$) and text+code ($\mathbb{P}_{TC}$) geometry problems. (a) The procedural code is wrapped with [asy][/asy] and its geometric figure is visualized to facilitate understanding. (b) and (c) show accuracy comparisons of models on $\mathbb{P}_T$ and $\mathbb{P}_{TC}$ subsets in AIME24 ($|\mathbb{P}_{TC}|=5$, $|\mathbb{P}_T|=25$) and MATH-500 ($|\mathbb{P}_{TC}|=42$, $|\mathbb{P}_T|=458$), respectively. In both benchmarks, accuracy consistently drops for problems with procedural code.
  • Figure 2: Distribution of problem difficulty levels and QwQ-32B accuracy for text-only ($\mathbb{P}_T$) vs. text+code ($\mathbb{P}_{TC}$) geometry problems on MATH-500.
  • Figure 3: Representative examples from GeoGramBench illustrating the three ascending Program-to-Geometry difficulty levels: Primitive Recognition, Local Relation Composition, and Global Abstract Integration. Each category is exemplified by two sampled problems, highlighting the increasing spatial complexity and abstraction across levels.
  • Figure 4: Illustration of two types of answer leakage in procedural code (highlighted in yellow): Left—Direct leakage, where the answer is explicitly given by a coordinate value in the Asymptote code (here, we rescale the coordinates to preserve the geometric shape); Right—Indirect leakage, where the answer can be computed from code parameters (in this case, we modify the procedural code to mask such critical information).
  • Figure 5: Illustrative solution process generated by the QwQ-32B model on a Local Relation Composition problem. The model initially attempts to construct spatial representations from the provided code, then interprets geometric elements such as direction and region, exhibiting behavior aligned with all three research questions (RQ1–RQ3): local construction, compositional integration, and chain-of-thought-based refinement. Multiple rounds of reflection and verification are observed, although these iterative steps do not consistently yield correct or fully integrated solutions.
  • ...and 7 more figures