Table of Contents
Fetching ...

Generalizable Humanoid Manipulation with 3D Diffusion Policies

Yanjie Ze, Zixuan Chen, Wenhao Wang, Tianyi Chen, Xialin He, Ying Yuan, Xue Bin Peng, Jiajun Wu

TL;DR

This work tackles the challenge of enabling full-sized humanoid robots to autonomously manipulate across diverse, unseen real-world scenes using data collected from a single scene. It introduces a real-world platform with a 25-DoF upper body mounted on a height-adjustable cart, a whole-upper-body teleoperation setup, and an improved egocentric 3D diffusion policy (iDP3) trained on real human demonstrations. Key innovations include egocentric 3D representations, scaled 3D vision inputs, a pyramid visual encoder, and a longer prediction horizon, all enabling robust zero-shot generalization and onboard real-time control. Across 2000+ real-world trials, the approach demonstrates generalization to kitchens, offices, and other unseen settings, highlighting the practicality and potential of 3D diffusion-based imitation learning for humanoid manipulation.

Abstract

Humanoid robots capable of autonomous operation in diverse environments have long been a goal for roboticists. However, autonomous manipulation by humanoid robots has largely been restricted to one specific scene, primarily due to the difficulty of acquiring generalizable skills and the expensiveness of in-the-wild humanoid robot data. In this work, we build a real-world robotic system to address this challenging problem. Our system is mainly an integration of 1) a whole-upper-body robotic teleoperation system to acquire human-like robot data, 2) a 25-DoF humanoid robot platform with a height-adjustable cart and a 3D LiDAR sensor, and 3) an improved 3D Diffusion Policy learning algorithm for humanoid robots to learn from noisy human data. We run more than 2000 episodes of policy rollouts on the real robot for rigorous policy evaluation. Empowered by this system, we show that using only data collected in one single scene and with only onboard computing, a full-sized humanoid robot can autonomously perform skills in diverse real-world scenarios. Videos are available at https://humanoid-manipulation.github.io .

Generalizable Humanoid Manipulation with 3D Diffusion Policies

TL;DR

This work tackles the challenge of enabling full-sized humanoid robots to autonomously manipulate across diverse, unseen real-world scenes using data collected from a single scene. It introduces a real-world platform with a 25-DoF upper body mounted on a height-adjustable cart, a whole-upper-body teleoperation setup, and an improved egocentric 3D diffusion policy (iDP3) trained on real human demonstrations. Key innovations include egocentric 3D representations, scaled 3D vision inputs, a pyramid visual encoder, and a longer prediction horizon, all enabling robust zero-shot generalization and onboard real-time control. Across 2000+ real-world trials, the approach demonstrates generalization to kitchens, offices, and other unseen settings, highlighting the practicality and potential of 3D diffusion-based imitation learning for humanoid manipulation.

Abstract

Humanoid robots capable of autonomous operation in diverse environments have long been a goal for roboticists. However, autonomous manipulation by humanoid robots has largely been restricted to one specific scene, primarily due to the difficulty of acquiring generalizable skills and the expensiveness of in-the-wild humanoid robot data. In this work, we build a real-world robotic system to address this challenging problem. Our system is mainly an integration of 1) a whole-upper-body robotic teleoperation system to acquire human-like robot data, 2) a 25-DoF humanoid robot platform with a height-adjustable cart and a 3D LiDAR sensor, and 3) an improved 3D Diffusion Policy learning algorithm for humanoid robots to learn from noisy human data. We run more than 2000 episodes of policy rollouts on the real robot for rigorous policy evaluation. Empowered by this system, we show that using only data collected in one single scene and with only onboard computing, a full-sized humanoid robot can autonomously perform skills in diverse real-world scenarios. Videos are available at https://humanoid-manipulation.github.io .

Paper Structure

This paper contains 13 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of our system. Our system mainly consists of four parts: the humanoid robot platform, the data collection system, the visuomotor policy learning method, and the real-world deployment. With this system, our humanoid robot performs autonomous skills in diverse real-world scenes.
  • Figure 2: iDP3 utilizes 3D representations in the camera frame, while the 3D representations of other recent 3D policies including DP3 Ze2024DP3 are in the world frame, which relies on accurate camera calibration and can not be extended to mobile robots.
  • Figure 3: Visualization of egocentric 2D and 3D observations. This figure highlights the complexity of diverse real-world scenes. Videos are available on https://humanoid-manipulation.github.io .
  • Figure 4: Trajectories of our three tasks in the training scene, including Pick&Place, Pour, and Wipe. We carefully select daily tasks so that the objects are common in daily scenes and the skills are useful across scenes.
  • Figure 5: Failure cases of image-based methods in new scenes. Here DP corresponds to DP (✶R3M) in Table \ref{['table: compare to baselines']}, which is the strongest image-based baseline we have. We find that even added with color augmentation during training, image-based methods still struggle in the new scene/object.
  • ...and 3 more figures