Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations
Ruiqian Nai, Boyuan Zheng, Junming Zhao, Haodong Zhu, Sicong Dai, Zunhao Chen, Yihang Hu, Yingdong Hu, Tong Zhang, Chuan Wen, Yang Gao
TL;DR
Humanoid whole-body manipulation remains data-inefficient under teleoperation or sim-to-real RL. HuMI introduces robot-free demonstrations via portable hardware and a hierarchical learning pipeline, combining a diffusion-based high-level policy with a manipulation-centric low-level controller and IK-aware data collection. The approach delivers a first robot-free humanoid whole-body demonstration system, achieves up to $3$× data-throughput improvements over teleoperation, and shows $70\%$ success in unseen environments across five tasks, with robust generalization to unseen objects. By integrating IK previews, adaptive end-effector tracking, and a carefully designed policy interface, HuMI enables broad, coordinated, and high-precision whole-body skills that generalize beyond controlled lab settings.
Abstract
Current approaches for humanoid whole-body manipulation, primarily relying on teleoperation or visual sim-to-real reinforcement learning, are hindered by hardware logistics and complex reward engineering. Consequently, demonstrated autonomous skills remain limited and are typically restricted to controlled environments. In this paper, we present the Humanoid Manipulation Interface (HuMI), a portable and efficient framework for learning diverse whole-body manipulation tasks across various environments. HuMI enables robot-free data collection by capturing rich whole-body motion using portable hardware. This data drives a hierarchical learning pipeline that translates human motions into dexterous and feasible humanoid skills. Extensive experiments across five whole-body tasks--including kneeling, squatting, tossing, walking, and bimanual manipulation--demonstrate that HuMI achieves a 3x increase in data collection efficiency compared to teleoperation and attains a 70% success rate in unseen environments.
