Table of Contents
Fetching ...

EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration

Modi Shi, Shijia Peng, Jin Chen, Haoran Jiang, Yinghui Li, Di Huang, Ping Luo, Hongyang Li, Li Chen

TL;DR

EgoHumanoid tackles the challenge of transferring human loco-manipulation skills to humanoid robots by leveraging abundant egocentric human demonstrations alongside limited robot data. It introduces an embodiment-alignment pipeline with view alignment (depth-based reprojection and inpainting) and action alignment (unified delta end-effector and locomotion spaces) to enable vision-language-action co-training across data sources. The framework is validated on a Unitree G1 across four real-world tasks, showing significant generalization gains, with average improvements of up to $82\%$ in unseen environments and $60\%$-level gains on challenging sub-skills when incorporating human data. Key contributions include the first demonstration of human-to-humanoid loco-manipulation transfer, a principled cross-embodiment alignment approach, and comprehensive real-world evaluations that reveal scaling laws and transferable behaviors. This work demonstrates the practical potential of scalable egocentric human data to broaden the generalization and deployment of humanoid control systems in diverse, unstructured settings.

Abstract

Human demonstrations offer rich environmental diversity and scale naturally, making them an appealing alternative to robot teleoperation. While this paradigm has advanced robot-arm manipulation, its potential for the more challenging, data-hungry problem of humanoid loco-manipulation remains largely unexplored. We present EgoHumanoid, the first framework to co-train a vision-language-action policy using abundant egocentric human demonstrations together with a limited amount of robot data, enabling humanoids to perform loco-manipulation across diverse real-world environments. To bridge the embodiment gap between humans and robots, including discrepancies in physical morphology and viewpoint, we introduce a systematic alignment pipeline spanning from hardware design to data processing. A portable system for scalable human data collection is developed, and we establish practical collection protocols to improve transferability. At the core of our human-to-humanoid alignment pipeline lies two key components. The view alignment reduces visual domain discrepancies caused by camera height and perspective variation. The action alignment maps human motions into a unified, kinematically feasible action space for humanoid control. Extensive real-world experiments demonstrate that incorporating robot-free egocentric data significantly outperforms robot-only baselines by 51\%, particularly in unseen environments. Our analysis further reveals which behaviors transfer effectively and the potential for scaling human data.

EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration

TL;DR

EgoHumanoid tackles the challenge of transferring human loco-manipulation skills to humanoid robots by leveraging abundant egocentric human demonstrations alongside limited robot data. It introduces an embodiment-alignment pipeline with view alignment (depth-based reprojection and inpainting) and action alignment (unified delta end-effector and locomotion spaces) to enable vision-language-action co-training across data sources. The framework is validated on a Unitree G1 across four real-world tasks, showing significant generalization gains, with average improvements of up to in unseen environments and -level gains on challenging sub-skills when incorporating human data. Key contributions include the first demonstration of human-to-humanoid loco-manipulation transfer, a principled cross-embodiment alignment approach, and comprehensive real-world evaluations that reveal scaling laws and transferable behaviors. This work demonstrates the practical potential of scalable egocentric human data to broaden the generalization and deployment of humanoid control systems in diverse, unstructured settings.

Abstract

Human demonstrations offer rich environmental diversity and scale naturally, making them an appealing alternative to robot teleoperation. While this paradigm has advanced robot-arm manipulation, its potential for the more challenging, data-hungry problem of humanoid loco-manipulation remains largely unexplored. We present EgoHumanoid, the first framework to co-train a vision-language-action policy using abundant egocentric human demonstrations together with a limited amount of robot data, enabling humanoids to perform loco-manipulation across diverse real-world environments. To bridge the embodiment gap between humans and robots, including discrepancies in physical morphology and viewpoint, we introduce a systematic alignment pipeline spanning from hardware design to data processing. A portable system for scalable human data collection is developed, and we establish practical collection protocols to improve transferability. At the core of our human-to-humanoid alignment pipeline lies two key components. The view alignment reduces visual domain discrepancies caused by camera height and perspective variation. The action alignment maps human motions into a unified, kinematically feasible action space for humanoid control. Extensive real-world experiments demonstrate that incorporating robot-free egocentric data significantly outperforms robot-only baselines by 51\%, particularly in unseen environments. Our analysis further reveals which behaviors transfer effectively and the potential for scaling human data.
Paper Structure (34 sections, 1 equation, 9 figures, 3 tables)

This paper contains 34 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Introducing EgoHumanoid, the first investigation on human-to-humanoid transfer for whole-body loco-manipulation. Robot teleoperation data collection is constrained to laboratory environment due to hardware and safety limitations, while in-the-wild human demonstrations offer scalable diversity in objects, scenes, lighting, and viewpoints. Our alignment pipeline bridges the embodiment gap through view and action alignment, enabling vision-language-action (VLA) co-training on both data sources. Real-world loco-manipulation deployment validates that egocentric human demonstrations invigorate generalization without scene-specific robot data, outperforming robot-only baselines by 51% with consistent scaling behavior.
  • Figure 2: Hardware setup for data collection. Humans and the G1 humanoid robot are equipped with an integrated VR-based system for portable usage and agile development. The same camera captures egocentric recordings. The VR headset and trackers provide coarse human poses, while the VR controller is employed to teleoperate the robot.
  • Figure 3: Pipeline of human-to-humanoid alignment.(a) View Alignment: Egocentric images are transformed to approximate robot viewpoints by reprojecting estimated depth points and generative inpainting to fill in blank holes. (b) Action Alignment: We employ relative end-effector poses to unify the upper body action space, and discrete commands for lower-body locomotion.
  • Figure 4: Humanoid loco-manipulation tasks for evaluation. We design four tasks which span varying levels of difficulty for large-space movement and dexterous manipulation. Robots are teleoperated in laboratories as the source domain (Top). Human-centric scenes occur in human demonstrations only (Middle), and set as testbeds for generalization evaluation (Bottom).
  • Figure 5: Performance of human-robot data co-training with EgoHumanoid. Our pipeline achieves unanimous improvements over robot-only baselines across in-domain and generalized environments. The boost is amplified in robots' unseen settings.
  • ...and 4 more figures