Table of Contents
Fetching ...

Humanoid Policy ~ Human Policy

Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J. Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, Xiaolong Wang

TL;DR

Humanoid Policy ~ Human Policy tackles the costly bottleneck of collecting robot demonstrations by leveraging egocentric human data. It introduces PH^2D, a task-oriented egocentric dataset with accurate 3D hand/finger poses, and HAT, a transformer-based policy that unifies human and humanoid state-action spaces and retargets actions differentiably. Empirical results show that co-training with human data substantially boosts out-of-distribution generalization and data efficiency, enabling robust cross-embodiment manipulation across different humanoids. This work demonstrates a scalable path to open-ended humanoid manipulation by treating humans as a rich data source for cross-embodiment learning.

Abstract

Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset (PH2D) that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency. Code and data: https://human-as-robot.github.io/

Humanoid Policy ~ Human Policy

TL;DR

Humanoid Policy ~ Human Policy tackles the costly bottleneck of collecting robot demonstrations by leveraging egocentric human data. It introduces PH^2D, a task-oriented egocentric dataset with accurate 3D hand/finger poses, and HAT, a transformer-based policy that unifies human and humanoid state-action spaces and retargets actions differentiably. Empirical results show that co-training with human data substantially boosts out-of-distribution generalization and data efficiency, enabling robust cross-embodiment manipulation across different humanoids. This work demonstrates a scalable path to open-ended humanoid manipulation by treating humans as a rich data source for cross-embodiment learning.

Abstract

Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset (PH2D) that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency. Code and data: https://human-as-robot.github.io/

Paper Structure

This paper contains 21 sections, 1 equation, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Consumer-grade Devices for Data Collection. To avoid relying on specialized hardware for data collection to make our method scalable, we design our data collection process using consumer-grade VR devices.
  • Figure 2: Overview of HAT. Human Action Transformer (HAT) learns a robot policy by modeling humans. During training, we sample a state-action pair from either human data or robot data. The images are encoded by a frozen DinoV2 encoder oquab2023-dinov2. The HAT model makes predictions in a human-centric action space using wrist 6 DoF poses and finger tips, which is retargeted to robot poses during real-robot deployment.
  • Figure 3: Hardware Illustration. Most robot data attributes to Humanoid A, a Unitree H1 robot. Humanoid B, a Unitree H1-2 robot with different arm motor configurations, is used to evaluate few-shot cross-humanoid transfer. Detailed comparisons in Sec. \ref{['sec:humanoid_differences']}
  • Figure 4: Few-Shot Adaptation. Co-training consistently outperforms isolated training as Humanoid B demonstrations increase, achieving robust success rates even in low-data regimes.
  • Figure 5: Human data has better sampling efficiency. Per-grid vertical grasping successes out of 10 trials with models trained with robot-only data and mixed data. Red boxes indicate where training data is collected.
  • ...and 2 more figures