Table of Contents
Fetching ...

Spatial Mental Modeling from Limited Views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei

TL;DR

MindCube introduces a challenging benchmark to probe Vision-Language Models on spatial reasoning under partial observations, revealing large gaps relative to human performance. The work systematically investigates scaffolds (view interpolation, cognitive maps, and reasoning) and shows that explicitly training models to build and reason over internal spatial representations yields the strongest gains. A map-then-reason paradigm, augmented with supervised fine-tuning and reinforced by reinforcement learning, pushes QA accuracy from ~38% to ~71%, demonstrating the potential of internal spatial scaffolding for robust understanding of unseen space. The findings offer practical guidance for designing spatially aware AI systems that reason about hidden or future states in partially observable environments.

Abstract

Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.

Spatial Mental Modeling from Limited Views

TL;DR

MindCube introduces a challenging benchmark to probe Vision-Language Models on spatial reasoning under partial observations, revealing large gaps relative to human performance. The work systematically investigates scaffolds (view interpolation, cognitive maps, and reasoning) and shows that explicitly training models to build and reason over internal spatial representations yields the strongest gains. A map-then-reason paradigm, augmented with supervised fine-tuning and reinforced by reinforcement learning, pushes QA accuracy from ~38% to ~71%, demonstrating the potential of internal spatial scaffolding for robust understanding of unseen space. The findings offer practical guidance for designing spatially aware AI systems that reason about hidden or future states in partially observable environments.

Abstract

Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.

Paper Structure

This paper contains 73 sections, 10 equations, 23 figures, 13 tables, 1 algorithm.

Figures (23)

  • Figure 1: Top: VLMs cannot maintain a coherent mental model when evaluating on the $\textsc{MindCube}$ benchmark. Bottom: We study how we can help VLMs imagine space through external (scaling of views, cognitive map input) and internal strategies (fine-tuning, cognitive map elicitation). We find joint cognitive map and reasoning setting yields the highest gain ($+32.86\%$). : Best within the same elicitation method. : Best performance combination.
  • Figure 1: Examples of camera poses in ArkitScenes
  • Figure 2: $\textsc{MindCube}$ taxonomy and examples. Left: Three camera movement patterns (Rotation, Around, Among) with corresponding spatial QA examples. Right: Four-dimensional taxonomy categorizing $\textsc{MindCube}$ questions types.
  • Figure 2: $\textsc{MindCube}$ Bench construction pipeline.
  • Figure 3: Grounded examples of our three data structures that approximate spatial mental models.
  • ...and 18 more figures