Table of Contents
Fetching ...

Motion Before Action: Diffusing Object Motion as Manipulation Condition

Yue Su, Xinyu Zhan, Hongjie Fang, Yong-Lu Li, Cewu Lu, Lixin Yang

TL;DR

MBA introduces a two-stage diffusion framework that first generates future object motions from observations and then conditions robot action generation on these motions. By factorizing $p(oldsymbol{M},oldsymbol{A}|oldsymbol{O})$ into $p(oldsymbol{M}|oldsymbol{O})p(oldsymbol{A}|oldsymbol{M},oldsymbol{O})$, MBA provides human-like reasoning for manipulation and improves kinematic consistency. Across 57 simulated tasks and four real-world tasks, MBA consistently boosts performance of diffusion-head policies, enhances learning efficiency, and demonstrates robustness to varied object types and tasks. The approach offers a practical, plug-and-play enhancement for existing robotic manipulation systems, with potential for broader adoption and future expansion to deformable objects and long-horizon planning.

Abstract

Inferring object motion representations from observations enhances the performance of robotic manipulation tasks. This paper introduces a new paradigm for robot imitation learning that generates action sequences by reasoning about object motion from visual observations. We propose MBA (Motion Before Action), a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. MBA first predicts the future pose sequence of the object based on observations, then uses this sequence as a condition to guide robot action generation. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads. Extensive experiments in both simulated and real-world environments demonstrate that our approach substantially improves the performance of existing policies across a wide range of manipulation tasks. Project page: https://selen-suyue.github.io/MBApage/

Motion Before Action: Diffusing Object Motion as Manipulation Condition

TL;DR

MBA introduces a two-stage diffusion framework that first generates future object motions from observations and then conditions robot action generation on these motions. By factorizing into , MBA provides human-like reasoning for manipulation and improves kinematic consistency. Across 57 simulated tasks and four real-world tasks, MBA consistently boosts performance of diffusion-head policies, enhances learning efficiency, and demonstrates robustness to varied object types and tasks. The approach offers a practical, plug-and-play enhancement for existing robotic manipulation systems, with potential for broader adoption and future expansion to deformable objects and long-horizon planning.

Abstract

Inferring object motion representations from observations enhances the performance of robotic manipulation tasks. This paper introduces a new paradigm for robot imitation learning that generates action sequences by reasoning about object motion from visual observations. We propose MBA (Motion Before Action), a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. MBA first predicts the future pose sequence of the object based on observations, then uses this sequence as a condition to guide robot action generation. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads. Extensive experiments in both simulated and real-world environments demonstrate that our approach substantially improves the performance of existing policies across a wide range of manipulation tasks. Project page: https://selen-suyue.github.io/MBApage/

Paper Structure

This paper contains 21 sections, 5 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Understanding object motion before action leads to better manipulation.Unlike existing methods that predict actions directly from observations, our approach first infers future object motions, enabling more accurate and goal-driven action prediction.
  • Figure 2: Overview of MBA pipeline. MBA takes the current observation as input, which could be in the form of 3D point clouds or RGB images from different viewpoints. Object pose sequences are sampled as actions with denoising diffusion from the object policy to be part of the framework's output. Conditioning on the observations and object pose actions, MBA samples deployable robot actions with denoising diffusion from the robot policy. These actions are executed within the workspace to update the environment state and the observations.
  • Figure 3: Average learning curves (success rate - training steps) over three runs comparing MBA-augmented and baseline policies.
  • Figure 4: Real-world deployment platform and execution process of four manipulation tasks.