HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos
Haoyang Weng, Yitang Li, Nikhil Sobanbabu, Zihan Wang, Zhengyi Luo, Tairan He, Deva Ramanan, Guanya Shi
TL;DR
HDMI tackles the challenge of learning robust, interactive whole-body humanoid-object skills from monocular RGB videos by converting videos into structured reference trajectories and training a robot–object co-tracking policy. It introduces a unified object representation, a residual action space, and a unified interaction reward to enable generalization across diverse objects and stable contact during interaction. The approach demonstrates strong sim-to-real transfer on a Unitree G1, achieving 67 consecutive door traversals and multiple loco-manipulation tasks in the real world, plus extensive tasks in simulation. Overall, HDMI provides a scalable, general framework for acquiring interactive humanoid skills directly from human videos, advancing the feasibility of autonomous HOI in real-world settings.
Abstract
Enabling robust whole-body humanoid-object interaction (HOI) remains challenging due to motion data scarcity and the contact-rich nature. We present HDMI (HumanoiD iMitation for Interaction), a simple and general framework that learns whole-body humanoid-object interaction skills directly from monocular RGB videos. Our pipeline (i) extracts and retargets human and object trajectories from unconstrained videos to build structured motion datasets, (ii) trains a reinforcement learning (RL) policy to co-track robot and object states with three key designs: a unified object representation, a residual action space, and a general interaction reward, and (iii) zero-shot deploys the RL policies on real humanoid robots. Extensive sim-to-real experiments on a Unitree G1 humanoid demonstrate the robustness and generality of our approach: HDMI achieves 67 consecutive door traversals and successfully performs 6 distinct loco-manipulation tasks in the real world and 14 tasks in simulation. Our results establish HDMI as a simple and general framework for acquiring interactive humanoid skills from human videos.
