Table of Contents
Fetching ...

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Manling Li

TL;DR

Problem: foundation models struggle with active spatial exploration under partial observability. Approach: Theory of Space introduces a belief-centric framework with Construct, Revise, Exploit operations and explicit cognitive-map probing in a multimodal benchmark to study active exploration. Contributions: (i) a task-agonistic active exploration paradigm, (ii) a spatial-environment benchmark with Route and Survey tasks, and (iii) diagnostics showing active exploration bottlenecks, perception limits, belief instability, and inertia, plus a false-belief revision test. Impact: provides a principled platform to drive development of uncertainty-aware exploration and robust spatial-belief maintenance for embodied AI in real-world, multimodal settings.

Abstract

Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent's ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

TL;DR

Problem: foundation models struggle with active spatial exploration under partial observability. Approach: Theory of Space introduces a belief-centric framework with Construct, Revise, Exploit operations and explicit cognitive-map probing in a multimodal benchmark to study active exploration. Contributions: (i) a task-agonistic active exploration paradigm, (ii) a spatial-environment benchmark with Route and Survey tasks, and (iii) diagnostics showing active exploration bottlenecks, perception limits, belief instability, and inertia, plus a false-belief revision test. Impact: provides a principled platform to drive development of uncertainty-aware exploration and robust spatial-belief maintenance for embodied AI in real-world, multimodal settings.

Abstract

Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent's ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.
Paper Structure (28 sections, 26 figures, 11 tables)

This paper contains 28 sections, 26 figures, 11 tables.

Figures (26)

  • Figure 1: Theory of Space: active exploration, probed belief, and evaluation. Left: a top-down view of agent trajectory under partial observability in multiple-room scenes. Middle: the agent’s action loop of moving, rotating, and observing in text- or vision-based environments, receiving egocentric observations and updating an internal belief. Right: evaluation through exploitation of the belief in spatial tasks and direct probing via probed cognitive maps.
  • Figure 2: Evaluation accuracy vs. exploration cost for active exploration in vision-world. Faded icons mark the passive setting, where the agent gets a pre-generated exploration history and only reasons.
  • Figure 3: Theory of Space exploitation task suite: it covers route-level egocentric reasoning and survey-level allocentric mapping. Route tasks evaluate path-based inference and egocentric observations. Survey tasks test global mapping, geometric transformation, and perspective conversion. Together they cover both local navigation reasoning and global spatial abstraction.
  • Figure 4: Accumulated information gain over exploration steps in the text world.
  • Figure 5: Internal Spatial Belief Probing. At each step, the agent executes an action, receives an observation, and updates its spatial belief. We probe this belief by prompting the agent to (i) output a JSON-structured cognitive map of all observed objects and (ii) select the next unexplored position from a top-down view given a set of labeled candidate points. For clarity, the figure shows the probing process for a single step.
  • ...and 21 more figures