Table of Contents
Fetching ...

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, Ranjay Krishna

TL;DR

STARE introduces a multimodal benchmark to evaluate spatial cognition in large language models via multi-step visual simulations. It spans 4K tasks across 2D/3D transformations, cube net folding, tangrams, and real-world perspective/temporal reasoning, with and without intermediate visual steps. Key findings show strong performance on simple 2D tasks, but near-random results on complex 3D and integrated puzzles, highlighting a critical gap in leveraging visual simulations; humans substantially outperform models but benefit from intermediate steps. The work highlights the nuanced impact of intermediate visuals, reveals model-specific gains and failures, and establishes STARE as a standard for advancing perceptual-spatial reasoning in AI, with implications for robotics, AR/VR, and education.

Abstract

Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

TL;DR

STARE introduces a multimodal benchmark to evaluate spatial cognition in large language models via multi-step visual simulations. It spans 4K tasks across 2D/3D transformations, cube net folding, tangrams, and real-world perspective/temporal reasoning, with and without intermediate visual steps. Key findings show strong performance on simple 2D tasks, but near-random results on complex 3D and integrated puzzles, highlighting a critical gap in leveraging visual simulations; humans substantially outperform models but benefit from intermediate steps. The work highlights the nuanced impact of intermediate visuals, reveals model-specific gains and failures, and establishes STARE as a standard for advancing perceptual-spatial reasoning in AI, with implications for robotics, AR/VR, and education.

Abstract

Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.

Paper Structure

This paper contains 60 sections, 23 figures, 14 tables.

Figures (23)

  • Figure 1: Overview of STARE. STARE consists of 3 levels of tasks, 2D Transformation and 3D Transformation for foundational spatial reasoning skills, tangram puzzle and cube net folding for integrated spatial reasoning, temporal frame inference and perspective reasoning to mimic real-world scenarios. The intermediate steps for completing tasks in the first two levels can be explicitly simulated, while the more real-word spatial reasoning tasks requires more abstract and implict mental simulations.
  • Figure 2: The different variants in the Tangram Puzzle task. We provide visualizations of the complete interleaved inputs for all three types in Appendix \ref{['app:type_samples']}.
  • Figure 3: GPT-4o performance on individual 2D/3D transformation types, with and without Visual Simulation (VisSim).
  • Figure 4: A perception error from Claude-3.5 Sonnet. Refer to Appendix \ref{['app:more_case_study']} for more case study.
  • Figure 5: GPT-4o performance vs. task complexity (left: difficulty levels and right: number of transformation steps) with or without Visual Simulation (VSim).
  • ...and 18 more figures