Table of Contents
Fetching ...

Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation

Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, Guanya Shi

TL;DR

This work introduces H2O, a scalable RL-based framework for real-time whole-body humanoid teleoperation using only an RGB camera. By combining a retargeting pipeline with a sim-to-data approach and domain randomization, it trains a robust motion imitator that transfers to real hardware in a zero-shot manner. The system demonstrates diverse, dynamic capabilities (e.g., walking, kicking, coordinating full-body motions) and shows strong performance in both simulation and real-world tests. The study also outlines practical limitations and future directions toward universal, real-time, and fully embodied humanoid teleoperation.

Abstract

We present Human to Humanoid (H2O), a reinforcement learning (RL) based framework that enables real-time whole-body teleoperation of a full-sized humanoid robot with only an RGB camera. To create a large-scale retargeted motion dataset of human movements for humanoid robots, we propose a scalable "sim-to-data" process to filter and pick feasible motions using a privileged motion imitator. Afterwards, we train a robust real-time humanoid motion imitator in simulation using these refined motions and transfer it to the real humanoid robot in a zero-shot manner. We successfully achieve teleoperation of dynamic whole-body motions in real-world scenarios, including walking, back jumping, kicking, turning, waving, pushing, boxing, etc. To the best of our knowledge, this is the first demonstration to achieve learning-based real-time whole-body humanoid teleoperation.

Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation

TL;DR

This work introduces H2O, a scalable RL-based framework for real-time whole-body humanoid teleoperation using only an RGB camera. By combining a retargeting pipeline with a sim-to-data approach and domain randomization, it trains a robust motion imitator that transfers to real hardware in a zero-shot manner. The system demonstrates diverse, dynamic capabilities (e.g., walking, kicking, coordinating full-body motions) and shows strong performance in both simulation and real-world tests. The study also outlines practical limitations and future directions toward universal, real-time, and fully embodied humanoid teleoperation.

Abstract

We present Human to Humanoid (H2O), a reinforcement learning (RL) based framework that enables real-time whole-body teleoperation of a full-sized humanoid robot with only an RGB camera. To create a large-scale retargeted motion dataset of human movements for humanoid robots, we propose a scalable "sim-to-data" process to filter and pick feasible motions using a privileged motion imitator. Afterwards, we train a robust real-time humanoid motion imitator in simulation using these refined motions and transfer it to the real humanoid robot in a zero-shot manner. We successfully achieve teleoperation of dynamic whole-body motions in real-world scenarios, including walking, back jumping, kicking, turning, waving, pushing, boxing, etc. To the best of our knowledge, this is the first demonstration to achieve learning-based real-time whole-body humanoid teleoperation.
Paper Structure (32 sections, 7 figures, 4 tables)

This paper contains 32 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The humanoid robot is teleoperated in real-time using an RGB camera by the human teleoperator. (a) The humanoid mimics the human teleoperator, advancing one step while delivering a punch to displace a box, followed by a victory gesture. (b) The humanoid executes a precise sidestep to align with a ball and delivers a controlled kick using its right foot. (c) The humanoid demonstrates forward walking while pushing a stroller. (d) The operator teleoperates the humanoid to catch a box, rotate its waist, and drop the box into a waste bin. Videos: see the https://human2humanoid.com.
  • Figure 2: Fitting the SMPL body to the H1 humanoid. (a) Visualization of the humanoid keypoints (red dots) (b) Humanoid keypoints vs SMPL keypoints (green dots and mesh) before and after fitted SMPL shape ${\boldsymbol{{\beta}}}'$. (c) Corresponding 12 joint position before and after fitting.
  • Figure 3: Effect of using a fitted SMPL shape ${\boldsymbol{{\beta}}}'$ instead of mean body shape on position-based retargeting. (a) Retargting without using ${\boldsymbol{{\beta}}}'$, which results in unstable "in-toed" humanoid motion. (b) Retargeting using ${\boldsymbol{{\beta}}}'$, which result in balanced humanoid motion.
  • Figure 4: Overview of H2O: (a) Retargeting (\ref{['SEC:retargeting']}): H2O first aligns the SMPL body model to a humanoid's structure by optimizing shape parameters. Then H2O retargets and removes the infeasible motions using a trained privileged imitation policy, producing a clean motion dataset. (b) Sim-to-Real Training: (\ref{['SEC:PolicyTraining']}) An imitation policy is trained to track motion goals sampled from a cleaned dataset. (c) Real-time Teleoperation Deployment (\ref{['sec:real_experiments']}): The real-time teleoperation deployment captures human motion through an RGB camera and a pose estimator, which is then mimicked by a humanoid robot using the trained sim-to-real imitation policy.
  • Figure 5: The humanoid robot is able to track the precise lower-body movements of the human teleoperator.
  • ...and 2 more figures