Table of Contents
Fetching ...

Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning

Yijun Liu, Yuwei Liu, Yuan Meng, Jieheng Zhang, Yuwei Zhou, Ye Li, Jiacheng Jiang, Kangye Ji, Shijia Ge, Zhi Wang, Wenwu Zhu

TL;DR

SP addresses the缺乏空间感知的挑战 in visuomotor robotic manipulation by introducing a spatially grounded framework that couples explicit spatial planning with video imagination and action execution. It uses a Spatial Plan Table to condition a diffusion-based video generator, followed by a flow-based diffusion policy for action prediction, and a Spatial Reasoning Feedback Policy that performs dual-stage replanning guided by vision-language feedback. The approach yields improved task success across Meta-World and iTHOR benchmarks and demonstrates practical viability in real-world robot experiments, highlighting the importance of structured spatial reasoning for robust long-horizon control. The combination of spatially conditioned video synthesis, flow-aware action planning, and closed-loop spatial refinement offers a principled path toward reliable, spatially consistent embodied manipulation in diverse environments. The framework’s modular design and emphasis on explicit spatial geometry suggest strong potential for generalization to other embodied tasks and real-world deployment, especially where depth and spatial layouts are variable.

Abstract

Vision-centric hierarchical embodied models have demonstrated strong potential. However, existing methods lack spatial awareness capabilities, limiting their effectiveness in bridging visual plans to actionable control in complex environments. To address this problem, we propose Spatial Policy (SP), a unified spatial-aware visuomotor robotic manipulation framework via explicit spatial modeling and reasoning. Specifically, we first design a spatial-conditioned embodied video generation module to model spatially guided predictions through the spatial plan table. Then, we propose a flow-based action prediction module to infer executable actions with coordination. Finally, we propose a spatial reasoning feedback policy to refine the spatial plan table via dual-stage replanning. Extensive experiments show that SP substantially outperforms state-of-the-art baselines, achieving over 33% improvement on Meta-World and over 25% improvement on iTHOR, demonstrating strong effectiveness across 23 embodied control tasks. We additionally evaluate SP in real-world robotic experiments to verify its practical viability. SP enhances the practicality of embodied models for robotic control applications. Code and checkpoints are maintained at https://plantpotatoonmoon.github.io/SpatialPolicy/.

Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning

TL;DR

SP addresses the缺乏空间感知的挑战 in visuomotor robotic manipulation by introducing a spatially grounded framework that couples explicit spatial planning with video imagination and action execution. It uses a Spatial Plan Table to condition a diffusion-based video generator, followed by a flow-based diffusion policy for action prediction, and a Spatial Reasoning Feedback Policy that performs dual-stage replanning guided by vision-language feedback. The approach yields improved task success across Meta-World and iTHOR benchmarks and demonstrates practical viability in real-world robot experiments, highlighting the importance of structured spatial reasoning for robust long-horizon control. The combination of spatially conditioned video synthesis, flow-aware action planning, and closed-loop spatial refinement offers a principled path toward reliable, spatially consistent embodied manipulation in diverse environments. The framework’s modular design and emphasis on explicit spatial geometry suggest strong potential for generalization to other embodied tasks and real-world deployment, especially where depth and spatial layouts are variable.

Abstract

Vision-centric hierarchical embodied models have demonstrated strong potential. However, existing methods lack spatial awareness capabilities, limiting their effectiveness in bridging visual plans to actionable control in complex environments. To address this problem, we propose Spatial Policy (SP), a unified spatial-aware visuomotor robotic manipulation framework via explicit spatial modeling and reasoning. Specifically, we first design a spatial-conditioned embodied video generation module to model spatially guided predictions through the spatial plan table. Then, we propose a flow-based action prediction module to infer executable actions with coordination. Finally, we propose a spatial reasoning feedback policy to refine the spatial plan table via dual-stage replanning. Extensive experiments show that SP substantially outperforms state-of-the-art baselines, achieving over 33% improvement on Meta-World and over 25% improvement on iTHOR, demonstrating strong effectiveness across 23 embodied control tasks. We additionally evaluate SP in real-world robotic experiments to verify its practical viability. SP enhances the practicality of embodied models for robotic control applications. Code and checkpoints are maintained at https://plantpotatoonmoon.github.io/SpatialPolicy/.

Paper Structure

This paper contains 40 sections, 9 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Framework Overview of Spatial Policy. Our system comprises three modules: (1) Spatial-Conditioned Embodied Video Generation, which uses a Spatial Plan Table obtained via VLM reasoning over the spatial offset between the robot and the object, to guide diffusion model in generating spatially coherent video prediction. (2) Flow-Based Action Prediction, which converts the generated video into executable actions using flow and spatial coordinates. (3) Spatial Reasoning Feedback Policy, which enables real-time correction via a dual-stage replanning strategy, combining VLM-based video judgement and policy diagnostics to refine the spatial plan table for closed-loop control.
  • Figure 2: Comparisons on Shelf Place and Basketball in Meta-World. The yellow line shows the motion trajectory; ★ marks the start, ★ the final position, and ● intermediate steps.
  • Figure 3: Comparisons on Painting and Mirror in iThor. The bounding box marks the target object.
  • Figure 4: Example of iterative spatial replanning during manipulation in Meta-World.
  • Figure 5: Real-world execution of the pick doll task.
  • ...and 5 more figures