Motion Before Action: Diffusing Object Motion as Manipulation Condition
Yue Su, Xinyu Zhan, Hongjie Fang, Yong-Lu Li, Cewu Lu, Lixin Yang
TL;DR
MBA introduces a two-stage diffusion framework that first generates future object motions from observations and then conditions robot action generation on these motions. By factorizing $p(oldsymbol{M},oldsymbol{A}|oldsymbol{O})$ into $p(oldsymbol{M}|oldsymbol{O})p(oldsymbol{A}|oldsymbol{M},oldsymbol{O})$, MBA provides human-like reasoning for manipulation and improves kinematic consistency. Across 57 simulated tasks and four real-world tasks, MBA consistently boosts performance of diffusion-head policies, enhances learning efficiency, and demonstrates robustness to varied object types and tasks. The approach offers a practical, plug-and-play enhancement for existing robotic manipulation systems, with potential for broader adoption and future expansion to deformable objects and long-horizon planning.
Abstract
Inferring object motion representations from observations enhances the performance of robotic manipulation tasks. This paper introduces a new paradigm for robot imitation learning that generates action sequences by reasoning about object motion from visual observations. We propose MBA (Motion Before Action), a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. MBA first predicts the future pose sequence of the object based on observations, then uses this sequence as a condition to guide robot action generation. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads. Extensive experiments in both simulated and real-world environments demonstrate that our approach substantially improves the performance of existing policies across a wide range of manipulation tasks. Project page: https://selen-suyue.github.io/MBApage/
