Table of Contents
Fetching ...

Embodiment-Agnostic Action Planning via Object-Part Scene Flow

Weiliang Tang, Jia-Hui Pan, Wei Zhan, Jianshu Zhou, Huaxiu Yao, Yun-Hui Liu, Masayoshi Tomizuka, Mingyu Ding, Chi-Wing Fu

TL;DR

This work proposes to generate the 3D object-part scene flow and extract its transformations to solve the action trajectories for diverse embodiments to derive the robot action explicitly from object motion prediction, yielding a more robust policy by understanding the object motions.

Abstract

Observing that the key for robotic action planning is to understand the target-object motion when its associated part is manipulated by the end effector, we propose to generate the 3D object-part scene flow and extract its transformations to solve the action trajectories for diverse embodiments. The advantage of our approach is that it derives the robot action explicitly from object motion prediction, yielding a more robust policy by understanding the object motions. Also, beyond policies trained on embodiment-centric data, our method is embodiment-agnostic, generalizable across diverse embodiments, and being able to learn from human demonstrations. Our method comprises three components: an object-part predictor to locate the part for the end effector to manipulate, an RGBD video generator to predict future RGBD videos, and a trajectory planner to extract embodiment-agnostic transformation sequences and solve the trajectory for diverse embodiments. Trained on videos even without trajectory data, our method still outperforms existing works significantly by 27.7% and 26.2% on the prevailing virtual environments MetaWorld and Franka-Kitchen, respectively. Furthermore, we conducted real-world experiments, showing that our policy, trained only with human demonstration, can be deployed to various embodiments.

Embodiment-Agnostic Action Planning via Object-Part Scene Flow

TL;DR

This work proposes to generate the 3D object-part scene flow and extract its transformations to solve the action trajectories for diverse embodiments to derive the robot action explicitly from object motion prediction, yielding a more robust policy by understanding the object motions.

Abstract

Observing that the key for robotic action planning is to understand the target-object motion when its associated part is manipulated by the end effector, we propose to generate the 3D object-part scene flow and extract its transformations to solve the action trajectories for diverse embodiments. The advantage of our approach is that it derives the robot action explicitly from object motion prediction, yielding a more robust policy by understanding the object motions. Also, beyond policies trained on embodiment-centric data, our method is embodiment-agnostic, generalizable across diverse embodiments, and being able to learn from human demonstrations. Our method comprises three components: an object-part predictor to locate the part for the end effector to manipulate, an RGBD video generator to predict future RGBD videos, and a trajectory planner to extract embodiment-agnostic transformation sequences and solve the trajectory for diverse embodiments. Trained on videos even without trajectory data, our method still outperforms existing works significantly by 27.7% and 26.2% on the prevailing virtual environments MetaWorld and Franka-Kitchen, respectively. Furthermore, we conducted real-world experiments, showing that our policy, trained only with human demonstration, can be deployed to various embodiments.
Paper Structure (15 sections, 2 equations, 9 figures, 4 tables)

This paper contains 15 sections, 2 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Illustration of our embodiment-agnostic action planning method. Beyond existing approaches, our method learns to generate object-part scene flow (in red), independent of any specific embodiment, thereby enabling it to handle diverse embodiments and produce the execution trajectory.
  • Figure 2: Framework overview. First, we identify the target object part and generate a future video to produce its hallucinated scene flow. Then, we predict the initial grasp pose and use a transformation solver on the scene flow for robotic trajectories.
  • Figure 3: Object parts described by language are illustrated by mask (red).
  • Figure 4: Illustration of our video generation network. The video generation network has a U-Net-like architecture and follows a denoising diffusion scheme to generate the RGBD sequence of the future frames.
  • Figure 5: Success rates (%, vertical axis) with respective to number of training demos (horizontal axis) in Meta-World.
  • ...and 4 more figures