Table of Contents
Fetching ...

SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery

Meng Cao, Xingyu Li, Xue Liu, Ian Reid, Xiaodan Liang

TL;DR

The paper addresses the challenge that MLLMs struggle with complex spatial reasoning requiring mental simulation. It introduces SpatialDreamer, an RL framework that couples active exploration with world-model-based visual imagination and evidence-grounded reasoning, enhanced by GeoPO, a tree-structured, step-aware policy optimization with geometric penalties. SpatialDreamer achieves state-of-the-art results on SAT, MindCube, and VSI-Bench, showing faster convergence and robust performance, and it introduces SpatialDreamer-SFT, a dataset crafted to train and evaluate agentic imagination (single-pass and reflective reasoning). The work advances intrinsic spatial imagery in LLMs by integrating perception, imagination, and reasoning into a closed-loop policy. Overall, SpatialDreamer marks a significant step toward human-like spatial mental simulation in multimodal reasoning systems with practical benchmarks and a specialized dataset for future research.

Abstract

Despite advancements in Multi-modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. To address the lack of fine-grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree-structured sampling and step-level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human-like active spatial mental simulation for MLLMs.

SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery

TL;DR

The paper addresses the challenge that MLLMs struggle with complex spatial reasoning requiring mental simulation. It introduces SpatialDreamer, an RL framework that couples active exploration with world-model-based visual imagination and evidence-grounded reasoning, enhanced by GeoPO, a tree-structured, step-aware policy optimization with geometric penalties. SpatialDreamer achieves state-of-the-art results on SAT, MindCube, and VSI-Bench, showing faster convergence and robust performance, and it introduces SpatialDreamer-SFT, a dataset crafted to train and evaluate agentic imagination (single-pass and reflective reasoning). The work advances intrinsic spatial imagery in LLMs by integrating perception, imagination, and reasoning into a closed-loop policy. Overall, SpatialDreamer marks a significant step toward human-like spatial mental simulation in multimodal reasoning systems with practical benchmarks and a specialized dataset for future research.

Abstract

Despite advancements in Multi-modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. To address the lack of fine-grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree-structured sampling and step-level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human-like active spatial mental simulation for MLLMs.

Paper Structure

This paper contains 16 sections, 7 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Conceptual comparisons between (left) vanilla GRPO and (right) our GeoPO. GRPO samples multiple independent trajectories and relies solely on episode-level rewards. In contrast, our GeoPO achieves step-level reward guidance through a tree-structured sampling scheme with geometric conflict and redundancy detection (identical or opposing actions between adjacent steps). $\lambda$ is the penalty coefficient. $R$ and $L$ denote the action of turning left/right, respectively.
  • Figure 2: (a) An overview of SpatialDreamer. In each round, SpatialDreamer thinks about the geometric context and imagines novel ego-centric views by invoking a world model using the rollout parameters (e.g., left-27m), and finally answers by integrating all the accumulated evidence. (b) The architecture of GeoPO. Starting from the question, at most $N$ trajectories are generated in each step until the answer is generated or the maximum depth limit $T_\text{max}$ is reached. The reward for a leaf node is computed based on the ground-truth answer, while the reward for any intermediate node is defined as the average of the rewards of all its direct child nodes. Additionally, a geometric penalty coefficient (i.e., 0.9) is imposed on sub-optimal rollouts including redundant or conflicting actions. "L/0.4" denotes turning left by 0.4 m, and other symbols follow the same convention. The values on the left of each node indicate the step-wise rewards.
  • Figure 3: The construction process of SpatialDreamer-SFT dataset including single-pass and reflective reasoning samples. Refer to supplementary materials for more details.
  • Figure 4: Efficiency analysis between GRPO and GeoPO. (a) Reward curve along training steps. (b) Per-trajectory generation time (s): the average time required to generate one trajectory; (c) Per-token generation time (ms): the average time required to generate one token.
  • Figure 5: Comparison of response length during training. GeoPO maintains stable and informative responses, while GRPO collapses to short outputs.
  • ...and 5 more figures