Table of Contents
Fetching ...

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

Mateusz Michalkiewicz, Anekha Sokhal, Tadeusz Michalkiewicz, Piotr Pawlikowski, Mahsa Baktashmotlagh, Varun Jampani, Guha Balakrishnan

TL;DR

GIQ presents a first-of-its-kind benchmark for evaluating geometric reasoning in vision foundation models using a taxonomy-rich collection of polyhedra, including Platonic, Archimedean, Catalan solids, Johnson solids, stellations, and compounds. The dataset combines 224 synthetic and real-world polyhedra with ground-truth geometry and symmetry, enabling assessments across monocular 3D reconstruction, 3D symmetry detection, mental rotation, and zero-shot shape classification. Across experiments, state-of-the-art reconstruction methods fail to capture basic geometric properties, while encoders show some symmetry awareness but struggle with detailed geometric differentiation; frontier vision-language models exhibit substantial limitations in translating geometric understanding into accurate classifications. The work positions GIQ as a geometric litmus test and a practical platform to guide the development of robust, geometry-aware representations for spatial reasoning in AI systems.

Abstract

Modern monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet recent works cast doubt on their true understanding of geometric properties. We introduce GOQ, a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images and corresponding 3D meshes of diverse polyhedra covering varying levels of complexity and symmetry, from Platonic, Archimedean, Johnson, and Catalan solids to stellations and compound shapes. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric Platonic solids accurately. Next, although foundation models may be shown via linear and non-linear probing to capture specific 3D symmetry elements, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants such as ChatGPT, Gemini and Claud exhibit remarkably low accuracy in interpreting basic shape properties such as face geometry, convexity, and compound structures of complex polyhedra. GIQ is publicly available at toomanymatts.github.io/giq-benchmark/, providing a structured platform to benchmark critical gaps in geometric intelligence and facilitate future progress in robust, geometry-aware representation learning.

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

TL;DR

GIQ presents a first-of-its-kind benchmark for evaluating geometric reasoning in vision foundation models using a taxonomy-rich collection of polyhedra, including Platonic, Archimedean, Catalan solids, Johnson solids, stellations, and compounds. The dataset combines 224 synthetic and real-world polyhedra with ground-truth geometry and symmetry, enabling assessments across monocular 3D reconstruction, 3D symmetry detection, mental rotation, and zero-shot shape classification. Across experiments, state-of-the-art reconstruction methods fail to capture basic geometric properties, while encoders show some symmetry awareness but struggle with detailed geometric differentiation; frontier vision-language models exhibit substantial limitations in translating geometric understanding into accurate classifications. The work positions GIQ as a geometric litmus test and a practical platform to guide the development of robust, geometry-aware representations for spatial reasoning in AI systems.

Abstract

Modern monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet recent works cast doubt on their true understanding of geometric properties. We introduce GOQ, a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images and corresponding 3D meshes of diverse polyhedra covering varying levels of complexity and symmetry, from Platonic, Archimedean, Johnson, and Catalan solids to stellations and compound shapes. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric Platonic solids accurately. Next, although foundation models may be shown via linear and non-linear probing to capture specific 3D symmetry elements, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants such as ChatGPT, Gemini and Claud exhibit remarkably low accuracy in interpreting basic shape properties such as face geometry, convexity, and compound structures of complex polyhedra. GIQ is publicly available at toomanymatts.github.io/giq-benchmark/, providing a structured platform to benchmark critical gaps in geometric intelligence and facilitate future progress in robust, geometry-aware representation learning.

Paper Structure

This paper contains 13 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Samples of synthetic and real 3D solids from our GIQ dataset. A subset of the 224 real polyhedra included in our dataset, illustrating their variety in complexity, class, and colors. (bottom left) Simulated solids from Mitsuba Physically Based Renderer. (bottom right) Real polyhedra constructed from paper, placed in different realistic backgrounds.
  • Figure 2: Summary of polyhedral groups in GIQ, highlighting group names, counts of distinct 3D shapes (in parentheses), and representative examples. Platonic, Archimedean, and Catalan solids are convex, while Kepler-Poinsot polyhedra and compounds represent special cases of stellations; consequently, the sum of group counts (238) exceeds the 224 unique shapes in the dataset. The categorization presented here is arbitrary: polyhedra possess numerous properties allowing various groupings; we selected this set as a representative example.
  • Figure 3: Left: Balanced accuracy ($0.5 \cdot \frac{\text{TP}}{P} + 0.5 \cdot \frac{\text{TN}}{N}$) for linear probing of 3D symmetry detection using embeddings from different featurizers. The linear classifier is trained only on synthetic images (Syn), and evaluated on real-world (Wild) images for detecting three symmetry types: central point reflection, 5-fold rotation, and 4-fold rotation. Right: Mental Rotation Test accuracy using non-linear probes. Top models (e.g., SigLIP) match the human average ($\sim$69%, green dotted line), though 68% of human participants still outperformed the best model.
  • Figure 4: (a) Zero-shot classification accuracy of various frontier models across polyhedron categories using wild images. Results on synthetic images showed only marginal differences and are provided in the appendix. (b) Qualitative reasoning failures of frontier vision-language models. Correct text in green, incorrect in red.