Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning

Yijun Liu; Yuwei Liu; Yuan Meng; Jieheng Zhang; Yuwei Zhou; Ye Li; Jiacheng Jiang; Kangye Ji; Shijia Ge; Zhi Wang; Wenwu Zhu

Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning

Yijun Liu, Yuwei Liu, Yuan Meng, Jieheng Zhang, Yuwei Zhou, Ye Li, Jiacheng Jiang, Kangye Ji, Shijia Ge, Zhi Wang, Wenwu Zhu

TL;DR

SP addresses the缺乏空间感知的挑战 in visuomotor robotic manipulation by introducing a spatially grounded framework that couples explicit spatial planning with video imagination and action execution. It uses a Spatial Plan Table to condition a diffusion-based video generator, followed by a flow-based diffusion policy for action prediction, and a Spatial Reasoning Feedback Policy that performs dual-stage replanning guided by vision-language feedback. The approach yields improved task success across Meta-World and iTHOR benchmarks and demonstrates practical viability in real-world robot experiments, highlighting the importance of structured spatial reasoning for robust long-horizon control. The combination of spatially conditioned video synthesis, flow-aware action planning, and closed-loop spatial refinement offers a principled path toward reliable, spatially consistent embodied manipulation in diverse environments. The framework’s modular design and emphasis on explicit spatial geometry suggest strong potential for generalization to other embodied tasks and real-world deployment, especially where depth and spatial layouts are variable.

Abstract

Vision-centric hierarchical embodied models have demonstrated strong potential. However, existing methods lack spatial awareness capabilities, limiting their effectiveness in bridging visual plans to actionable control in complex environments. To address this problem, we propose Spatial Policy (SP), a unified spatial-aware visuomotor robotic manipulation framework via explicit spatial modeling and reasoning. Specifically, we first design a spatial-conditioned embodied video generation module to model spatially guided predictions through the spatial plan table. Then, we propose a flow-based action prediction module to infer executable actions with coordination. Finally, we propose a spatial reasoning feedback policy to refine the spatial plan table via dual-stage replanning. Extensive experiments show that SP substantially outperforms state-of-the-art baselines, achieving over 33% improvement on Meta-World and over 25% improvement on iTHOR, demonstrating strong effectiveness across 23 embodied control tasks. We additionally evaluate SP in real-world robotic experiments to verify its practical viability. SP enhances the practicality of embodied models for robotic control applications. Code and checkpoints are maintained at https://plantpotatoonmoon.github.io/SpatialPolicy/.

Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning

TL;DR

Abstract

Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)