Table of Contents
Fetching ...

DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation

Yuhui Fu, Feiyang Xie, Chaoyi Xu, Jing Xiong, Haoqi Yuan, Zongqing Lu

TL;DR

DemoHLM tackles the challenge of generalizable humanoid loco-manipulation by combining a simulation-based data generation pipeline with a two-tier control hierarchy. From a single simulated demonstration, it synthesizes hundreds to thousands of trajectories across locomotion, pre-manipulation, and manipulation stages, and trains a high-level imitation-learning policy to drive a low-level whole-body controller. The approach demonstrates strong sim-to-real transfer on a Unitree G1 across ten tasks, with data quantity positively impacting performance and generalization, and showing compatibility with multiple BC architectures. This enables scalable, data-efficient learning for complex loco-manipulation tasks in real-world environments.

Abstract

Loco-manipulation is a fundamental challenge for humanoid robots to achieve versatile interactions in human environments. Although recent studies have made significant progress in humanoid whole-body control, loco-manipulation remains underexplored and often relies on hard-coded task definitions or costly real-world data collection, which limits autonomy and generalization. We present DemoHLM, a framework for humanoid loco-manipulation that enables generalizable loco-manipulation on a real humanoid robot from a single demonstration in simulation. DemoHLM adopts a hierarchy that integrates a low-level universal whole-body controller with high-level manipulation policies for multiple tasks. The whole-body controller maps whole-body motion commands to joint torques and provides omnidirectional mobility for the humanoid robot. The manipulation policies, learned in simulation via our data generation and imitation learning pipeline, command the whole-body controller with closed-loop visual feedback to execute challenging loco-manipulation tasks. Experiments show a positive correlation between the amount of synthetic data and policy performance, underscoring the effectiveness of our data generation pipeline and the data efficiency of our approach. Real-world experiments on a Unitree G1 robot equipped with an RGB-D camera validate the sim-to-real transferability of DemoHLM, demonstrating robust performance under spatial variations across ten loco-manipulation tasks.

DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation

TL;DR

DemoHLM tackles the challenge of generalizable humanoid loco-manipulation by combining a simulation-based data generation pipeline with a two-tier control hierarchy. From a single simulated demonstration, it synthesizes hundreds to thousands of trajectories across locomotion, pre-manipulation, and manipulation stages, and trains a high-level imitation-learning policy to drive a low-level whole-body controller. The approach demonstrates strong sim-to-real transfer on a Unitree G1 across ten tasks, with data quantity positively impacting performance and generalization, and showing compatibility with multiple BC architectures. This enables scalable, data-efficient learning for complex loco-manipulation tasks in real-world environments.

Abstract

Loco-manipulation is a fundamental challenge for humanoid robots to achieve versatile interactions in human environments. Although recent studies have made significant progress in humanoid whole-body control, loco-manipulation remains underexplored and often relies on hard-coded task definitions or costly real-world data collection, which limits autonomy and generalization. We present DemoHLM, a framework for humanoid loco-manipulation that enables generalizable loco-manipulation on a real humanoid robot from a single demonstration in simulation. DemoHLM adopts a hierarchy that integrates a low-level universal whole-body controller with high-level manipulation policies for multiple tasks. The whole-body controller maps whole-body motion commands to joint torques and provides omnidirectional mobility for the humanoid robot. The manipulation policies, learned in simulation via our data generation and imitation learning pipeline, command the whole-body controller with closed-loop visual feedback to execute challenging loco-manipulation tasks. Experiments show a positive correlation between the amount of synthetic data and policy performance, underscoring the effectiveness of our data generation pipeline and the data efficiency of our approach. Real-world experiments on a Unitree G1 robot equipped with an RGB-D camera validate the sim-to-real transferability of DemoHLM, demonstrating robust performance under spatial variations across ten loco-manipulation tasks.

Paper Structure

This paper contains 27 sections, 4 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of DemoHLM. For each task, we collect a single demonstration via VR teleoperation in simulation and record the robot trajectory in the object frame. This trajectory is then used to generate the pre-manipulation and manipulation phases in our data generation pipeline. The generated transitions include robot proprioception, object poses in the camera frame, and actions expressed as high-level commands sent to the whole-body controller. A manipulation policy is trained using imitation learning on this dataset and is successfully deployed on a real robot to perform loco-manipulation.
  • Figure 2: Hardware Design. We use a Unitree G1 humanoid robot in real-world experiments. To enable active vision, we mount a 2-DoF neck with an Intel RealSense D435 RGB-D camera. For tasks involving small objects, we attach parallel grippers to both end effectors.
  • Figure 3: Loco-manipulation Tasks. We evaluate DemoHLM on ten tasks in both simulation and the real world. Four tasks can be completed using the rubber hands, while the remaining six tasks require parallel grippers for grasping and manipulation. Each task is initialized with randomized object and robot poses, requiring spatial generalization of the learned policies.
  • Figure 4: Real-world policy rollouts. Each pair of rows shows time-aligned first-person and third-person views. Frames progress from left to right over time.
  • Figure 5: Rollouts on LiftBox in simulation and the real world. Frames are ordered left to right, illustrating key stages of the manipulation sequence executed by the learned policy.
  • ...and 1 more figures