SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery
Meng Cao, Xingyu Li, Xue Liu, Ian Reid, Xiaodan Liang
TL;DR
The paper addresses the challenge that MLLMs struggle with complex spatial reasoning requiring mental simulation. It introduces SpatialDreamer, an RL framework that couples active exploration with world-model-based visual imagination and evidence-grounded reasoning, enhanced by GeoPO, a tree-structured, step-aware policy optimization with geometric penalties. SpatialDreamer achieves state-of-the-art results on SAT, MindCube, and VSI-Bench, showing faster convergence and robust performance, and it introduces SpatialDreamer-SFT, a dataset crafted to train and evaluate agentic imagination (single-pass and reflective reasoning). The work advances intrinsic spatial imagery in LLMs by integrating perception, imagination, and reasoning into a closed-loop policy. Overall, SpatialDreamer marks a significant step toward human-like spatial mental simulation in multimodal reasoning systems with practical benchmarks and a specialized dataset for future research.
Abstract
Despite advancements in Multi-modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. To address the lack of fine-grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree-structured sampling and step-level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human-like active spatial mental simulation for MLLMs.
