GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

Yushun Zhang; Weiping Fu; Zesheng Yang; Bo Zhao; Lingling Zhang; Jian Zhang; Yumeng Fu; Jiaxing Huang; Jun Liu

GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

Yushun Zhang, Weiping Fu, Zesheng Yang, Bo Zhao, Lingling Zhang, Jian Zhang, Yumeng Fu, Jiaxing Huang, Jun Liu

Abstract

Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence.

GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

Abstract

Paper Structure (48 sections, 6 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 48 sections, 6 equations, 10 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Geometry Problem Generation
Existing geometry datasets
The GeoChallenge-90K dataset
Definitions
Clause.
Construction.
Premise.
Problem.
Problem Generation
Large-scale premises sampling
Challenging options generation
Symbol refinement
Manual Verification
...and 33 more sections

Figures (10)

Figure 1: Examples in GeoChallenge-90K dataset.
Figure 2: Pipeline of dataset generation
Figure 3: Geometry elements in GeoChallenge-90K
Figure 4: Error type across different models
Figure 5: the diagram corresponding to the premise above
...and 5 more figures

GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

Abstract

GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

Authors

Abstract

Table of Contents

Figures (10)