Table of Contents
Fetching ...

SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

Yuyou Zhang, Radu Corcodel, Chiori Hori, Anoop Cherian, Ding Zhao

TL;DR

SpinBench presents a cognitively grounded diagnostic benchmark to scrutinize spatial reasoning in vision-language models, emphasizing perspective taking across single- and multi-object scenes. It decomposes spatial reasoning into seven diagnostic categories with controlled frame-of-reference, premise-based prompts, and symmetry/syntactic variations, using both synthetic Infinigen data and real-world ABO, Cars, and Faces to cover 2,599 samples. Evaluating 37 VLMs reveals systematic weaknesses such as egocentric bias and rotation failures, with scaling exhibiting emergent capabilities and chain-of-thought prompting providing task-dependent gains; human response times correlate with model accuracy, indicating the benchmark captures cognitive difficulty shared across humans and machines. SpinBench delivers actionable diagnostics for embodied AI and spatial understanding in VLMs, highlighting gaps in core spatial competencies and guiding future model improvements.

Abstract

We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2\%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. We believe SpinBench provides critical insights into spatial reasoning in VLMs and highlights key gaps in their ability to reason about physical space. Our website can be found at https://spinbench25.github.io/.

SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

TL;DR

SpinBench presents a cognitively grounded diagnostic benchmark to scrutinize spatial reasoning in vision-language models, emphasizing perspective taking across single- and multi-object scenes. It decomposes spatial reasoning into seven diagnostic categories with controlled frame-of-reference, premise-based prompts, and symmetry/syntactic variations, using both synthetic Infinigen data and real-world ABO, Cars, and Faces to cover 2,599 samples. Evaluating 37 VLMs reveals systematic weaknesses such as egocentric bias and rotation failures, with scaling exhibiting emergent capabilities and chain-of-thought prompting providing task-dependent gains; human response times correlate with model accuracy, indicating the benchmark captures cognitive difficulty shared across humans and machines. SpinBench delivers actionable diagnostics for embodied AI and spatial understanding in VLMs, highlighting gaps in core spatial competencies and guiding future model improvements.

Abstract

We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2\%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. We believe SpinBench provides critical insights into spatial reasoning in VLMs and highlights key gaps in their ability to reason about physical space. Our website can be found at https://spinbench25.github.io/.

Paper Structure

This paper contains 72 sections, 39 figures, 5 tables.

Figures (39)

  • Figure 1: Overview of SpinBench task design across seven task groups. Representative subtasks are illustrated for each group with simplified question wording for clarity. In the released benchmark, all queries include explicit frame-of-reference definitions to avoid ambiguity. Human face data are sourced from the Stereo Face Database 10.1007/11564386_10 and are licensed for research use only.
  • Figure 2: Distribution of SpinBench tasks across seven spatial reasoning categories and four visual domains. Right: Task breakdown by domain.
  • Figure 3: Performance heatmap of 37 VLMs across 23 grouped task variants, organized under 7 spatial reasoning categories. Cohen's kappa values ($\kappa$) measure chance-adjusted performance, where $\kappa=0$ indicates chance-level and $\kappa=1$ perfect accuracy. Three chain-of-thought (CoT) variants of space reasoning models are included for comparison.
  • Figure 4: Strong correlation between spatial reasoning accuracy and consistency across vision-language models. Left: Model rankings by overall accuracy (top) and pair-wise consistency percentage (bottom), with colors indicating consistency levels. Right: Scatter plot revealing robust positive correlation (Pearson $r=0.874, p<0.05$) between the two metrics.
  • Figure 5: (a) Scatter plot comparing Perspective-taking(T) with premise accuracy against overall accuracy for each model, demonstrating that linguistic spatial reasoning failures are correlated with general model competence. Models are color-coded by Perspective-taking(T) with premise accuracy. (b) Scatter plot showing the relationship between VLM accuracy (x-axis) and human response time (y-axis) across 51 task subtypes.
  • ...and 34 more figures