Table of Contents
Fetching ...

HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos

Haoyang Weng, Yitang Li, Nikhil Sobanbabu, Zihan Wang, Zhengyi Luo, Tairan He, Deva Ramanan, Guanya Shi

TL;DR

HDMI tackles the challenge of learning robust, interactive whole-body humanoid-object skills from monocular RGB videos by converting videos into structured reference trajectories and training a robot–object co-tracking policy. It introduces a unified object representation, a residual action space, and a unified interaction reward to enable generalization across diverse objects and stable contact during interaction. The approach demonstrates strong sim-to-real transfer on a Unitree G1, achieving 67 consecutive door traversals and multiple loco-manipulation tasks in the real world, plus extensive tasks in simulation. Overall, HDMI provides a scalable, general framework for acquiring interactive humanoid skills directly from human videos, advancing the feasibility of autonomous HOI in real-world settings.

Abstract

Enabling robust whole-body humanoid-object interaction (HOI) remains challenging due to motion data scarcity and the contact-rich nature. We present HDMI (HumanoiD iMitation for Interaction), a simple and general framework that learns whole-body humanoid-object interaction skills directly from monocular RGB videos. Our pipeline (i) extracts and retargets human and object trajectories from unconstrained videos to build structured motion datasets, (ii) trains a reinforcement learning (RL) policy to co-track robot and object states with three key designs: a unified object representation, a residual action space, and a general interaction reward, and (iii) zero-shot deploys the RL policies on real humanoid robots. Extensive sim-to-real experiments on a Unitree G1 humanoid demonstrate the robustness and generality of our approach: HDMI achieves 67 consecutive door traversals and successfully performs 6 distinct loco-manipulation tasks in the real world and 14 tasks in simulation. Our results establish HDMI as a simple and general framework for acquiring interactive humanoid skills from human videos.

HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos

TL;DR

HDMI tackles the challenge of learning robust, interactive whole-body humanoid-object skills from monocular RGB videos by converting videos into structured reference trajectories and training a robot–object co-tracking policy. It introduces a unified object representation, a residual action space, and a unified interaction reward to enable generalization across diverse objects and stable contact during interaction. The approach demonstrates strong sim-to-real transfer on a Unitree G1, achieving 67 consecutive door traversals and multiple loco-manipulation tasks in the real world, plus extensive tasks in simulation. Overall, HDMI provides a scalable, general framework for acquiring interactive humanoid skills directly from human videos, advancing the feasibility of autonomous HOI in real-world settings.

Abstract

Enabling robust whole-body humanoid-object interaction (HOI) remains challenging due to motion data scarcity and the contact-rich nature. We present HDMI (HumanoiD iMitation for Interaction), a simple and general framework that learns whole-body humanoid-object interaction skills directly from monocular RGB videos. Our pipeline (i) extracts and retargets human and object trajectories from unconstrained videos to build structured motion datasets, (ii) trains a reinforcement learning (RL) policy to co-track robot and object states with three key designs: a unified object representation, a residual action space, and a general interaction reward, and (iii) zero-shot deploys the RL policies on real humanoid robots. Extensive sim-to-real experiments on a Unitree G1 humanoid demonstrate the robustness and generality of our approach: HDMI achieves 67 consecutive door traversals and successfully performs 6 distinct loco-manipulation tasks in the real world and 14 tasks in simulation. Our results establish HDMI as a simple and general framework for acquiring interactive humanoid skills from human videos.

Paper Structure

This paper contains 20 sections, 2 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: HDMI enables humanoid robots to acquire diverse whole-body interaction skills directly from human videos. (a) Traversing doors: the robot successfully passes through a door for 67 consecutive trials ($\sim 34$ mins), and remains robust under terrain changes. (b) Moving a cardboard box: the robot kneels to grasp and relocate the box, demonstrating coordinated whole-body motion. (c) Carrying and dropping objects: the robot walks forward to pick up and drop a pile of foam mats. (d) A wide range of interaction tasks in simulation, including toppling a wood board, opening a foldable chair, rolling a ball, carrying a box, and pushing a box. Website: https://hdmi-humanoid.github.io
  • Figure 2: HDMI is a general framework for interactive skill learning. Monocular RGB videos are processed into a structured dataset as reference trajectories (\ref{['sec:method-retarget']}), which are used to train an interaction centric policy via robot-object co-tracking (\ref{['sec:method-tracking']}). The trained policies are succesfully deployed to real world humanoids (\ref{['sec:real_world']}).
  • Figure 3: Reference contact position (yellow dot) in three different tasks. Policy observes the positions of these contact points in root frame, both during training and deployment.
  • Figure 4: Demonstrations on challenging real-world tasks. (a) Door opening and traversal: the robot adapts its footsteps to different initial poses and terrain variations (with/without wooden board), successfully completing 67 consecutive trips. (b) Box loco-manipulation: the policy enables versatile whole-body coordination for grasping, lifting, and transporting objects of varied shapes, sizes, and weights.
  • Figure 5: Truman's Bow. This demonstration highlights long and continuous sequence of diverse, contact-rich behaviors.
  • ...and 4 more figures