Table of Contents
Fetching ...

Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations

Ruiqian Nai, Boyuan Zheng, Junming Zhao, Haodong Zhu, Sicong Dai, Zunhao Chen, Yihang Hu, Yingdong Hu, Tong Zhang, Chuan Wen, Yang Gao

TL;DR

Humanoid whole-body manipulation remains data-inefficient under teleoperation or sim-to-real RL. HuMI introduces robot-free demonstrations via portable hardware and a hierarchical learning pipeline, combining a diffusion-based high-level policy with a manipulation-centric low-level controller and IK-aware data collection. The approach delivers a first robot-free humanoid whole-body demonstration system, achieves up to $3$× data-throughput improvements over teleoperation, and shows $70\%$ success in unseen environments across five tasks, with robust generalization to unseen objects. By integrating IK previews, adaptive end-effector tracking, and a carefully designed policy interface, HuMI enables broad, coordinated, and high-precision whole-body skills that generalize beyond controlled lab settings.

Abstract

Current approaches for humanoid whole-body manipulation, primarily relying on teleoperation or visual sim-to-real reinforcement learning, are hindered by hardware logistics and complex reward engineering. Consequently, demonstrated autonomous skills remain limited and are typically restricted to controlled environments. In this paper, we present the Humanoid Manipulation Interface (HuMI), a portable and efficient framework for learning diverse whole-body manipulation tasks across various environments. HuMI enables robot-free data collection by capturing rich whole-body motion using portable hardware. This data drives a hierarchical learning pipeline that translates human motions into dexterous and feasible humanoid skills. Extensive experiments across five whole-body tasks--including kneeling, squatting, tossing, walking, and bimanual manipulation--demonstrate that HuMI achieves a 3x increase in data collection efficiency compared to teleoperation and attains a 70% success rate in unseen environments.

Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations

TL;DR

Humanoid whole-body manipulation remains data-inefficient under teleoperation or sim-to-real RL. HuMI introduces robot-free demonstrations via portable hardware and a hierarchical learning pipeline, combining a diffusion-based high-level policy with a manipulation-centric low-level controller and IK-aware data collection. The approach delivers a first robot-free humanoid whole-body demonstration system, achieves up to × data-throughput improvements over teleoperation, and shows success in unseen environments across five tasks, with robust generalization to unseen objects. By integrating IK previews, adaptive end-effector tracking, and a carefully designed policy interface, HuMI enables broad, coordinated, and high-precision whole-body skills that generalize beyond controlled lab settings.

Abstract

Current approaches for humanoid whole-body manipulation, primarily relying on teleoperation or visual sim-to-real reinforcement learning, are hindered by hardware logistics and complex reward engineering. Consequently, demonstrated autonomous skills remain limited and are typically restricted to controlled environments. In this paper, we present the Humanoid Manipulation Interface (HuMI), a portable and efficient framework for learning diverse whole-body manipulation tasks across various environments. HuMI enables robot-free data collection by capturing rich whole-body motion using portable hardware. This data drives a hierarchical learning pipeline that translates human motions into dexterous and feasible humanoid skills. Extensive experiments across five whole-body tasks--including kneeling, squatting, tossing, walking, and bimanual manipulation--demonstrate that HuMI achieves a 3x increase in data collection efficiency compared to teleoperation and attains a 70% success rate in unseen environments.
Paper Structure (34 sections, 6 equations, 13 figures, 3 tables)

This paper contains 34 sections, 6 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Humanoid Manipulation Interface (HuMI). Left: Our portable, robot-free data collection facilitates skill transfer from human to humanoid across diverse, unstructured environments. Right: The framework enables a wide repertoire of complex whole-body behaviors.
  • Figure 2: Overview of the HuMI data collection system. (a) Challenges: Relying solely on gripper poses under-specifies whole-body motion, leading to unnatural postures (top); meanwhile, naively scaling human motions to match the robot's size compromises the spatial alignment required for object interaction (bottom). (b) Hardware Setup: Our portable system utilizes handheld sensorized grippers and trackers on the grippers, waist, and feet. A real-time IK preview interface enables human-in-the-loop kinematic adaptation. (c) Data Processing: Collected data serves two purposes: visual observations and task-space SE(3) trajectories train the high-level policy, while whole-body IK solutions provide reference motions for the low-level controller.
  • Figure 3: Hierarchical control framework of HuMI. (1) A high-level Diffusion Policy (5Hz) processes camera images and proprioception to generate receding-horizon task-space trajectories (action chunks). (2) A low-level Whole-Body Controller (50Hz) tracks these keypoint targets $p_t$, integrating the current robot state $s_t$ (IMU, joint positions/velocities) to compute precise joint actuation commands $a_t$.
  • Figure 4: Impact of reference frame selection on action chunk continuity. Due to tracking error, the executed robot pose (dark gray) "lags" behind the scheduled target (light gray). Naively anchoring the next action chunk to the current executed pose results in a sudden trajectory reversal (red line), disrupting momentum. By instead using the previous scheduled target as the reference frame, the policy produces a smooth, continuous trajectory (green line) that maintains the intended motion profile.
  • Figure 5: Mitigating drift in non-vision-grounded keypoints.Left: Trajectories during a doll-grasping task. The "sighted" gripper (green) remains anchored via visual feedback, whereas the "blind" pelvis (red) suffers from open-loop drift ($>5$ cm) over time. Right: Decomposition of the action chunk at time $t$. Because the absolute height (left axis) is corrupted by cumulative error, we discard absolute tracking in favor of relative transforms within the chunk (right axis).
  • ...and 8 more figures