Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

Chen Yang; Guanxin Lin; Youquan He; Peiyao Chen; Guanghe Liu; Yufan Mo; Zhouyuan Xu; Linhao Wang; Guohui Zhang; Zihang Zhang; Shenxiang Zeng; Chen Wang; Jiansheng Fan

Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

Chen Yang, Guanxin Lin, Youquan He, Peiyao Chen, Guanghe Liu, Yufan Mo, Zhouyuan Xu, Linhao Wang, Guohui Zhang, Zihang Zhang, Shenxiang Zeng, Chen Wang, Jiansheng Fan

TL;DR

The paper introduces SSI-Bench, a constrained-manifold spatial reasoning benchmark that evaluates how well vision-language models recover constraint-consistent 3D structure from complex real-world engineering scenes. It formalizes CMSR with a latent state on a feasible manifold and a ranking-based ground-truth criterion, organizing tasks into Geometric, Topological, and Multi-View categories across 1,000 questions. A fully human-centered construction pipeline yields high-quality, unambiguous questions designed to minimize 2D cue leakage, and evaluation across 31 VLMs reveals a large gap to human performance (humans ~91.6%), with the best models around the low-30s and open-source models generally lower. Thinking-based prompting provides only modest improvements, and error analyses identify core bottlenecks in structural grounding and globally consistent 3D reasoning, informing directions for future structure-aware, geometry/topology-oriented multimodal learning.

Abstract

Spatial intelligence is crucial for vision--language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely used VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Encouraging models to think yields only marginal gains, and error analysis points to failures in structural grounding and constraint-consistent 3D reasoning. Project page: https://ssi-bench.github.io.

Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

TL;DR

Abstract

Paper Structure (31 sections, 10 equations, 25 figures, 5 tables)

This paper contains 31 sections, 10 equations, 25 figures, 5 tables.

Introduction
Related Work
SSI-Bench
Problem Formulation
Overview of SSI-Bench
Benchmark Construction Progress
Experiments and Analysis
Evaluation Settings
Main Results
Impact of Thinking on CMSR
Error Analysis
Conclusions
Appendix Overview and Organization
Comparison with Other Spatial Intelligence Benchmarks
Dataset Statistics
...and 16 more sections

Figures (25)

Figure 1: SSI-Bench is a diverse, human-annotated, and challenging benchmark, designed to evaluate models’ constrained-manifold spatial reasoning on complex real-world 3D structures. The bar chart on the right illustrates the significant performance gap between state-of-the-art VLMs and human performance on SSI-Bench.
Figure 2: Representative SSI-Bench samples from each category. For visualization, we overlay all candidates in one image; the benchmark provides separately annotated images per option. Ties use the smaller index first. Full questions are in Appendix \ref{['app:samples']}.
Figure 3: Illustration of the SSI-Bench construction pipeline.
Figure 4: (Left) Relationship between thinking-token usage and accuracy; (Right) Sub-category level effects of thinking on CMSR.
Figure 5: Illustration of four error types identified in VLM spatial reasoning on SSI-Bench.
...and 20 more figures

Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

TL;DR

Abstract

Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

Authors

TL;DR

Abstract

Table of Contents

Figures (25)