X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou
TL;DR
The paper tackles data scarcity in robotics by introducing X-Humanoid, a video-to-video editing framework that converts human videos into humanoid videos using a fine-tuned Wan 2.2 diffusion model. It addresses the lack of large-scale robot video data by generating a synthetic paired dataset in Unreal Engine and applying the model to real-world videos to produce a 60-hour, 3.6 million-frame robotized Ego-Exo4D dataset. Through LoRA-based finetuning and a flow-matching objective, the approach preserves human motion while faithfully depicting a fixed humanoid embodiment, outperforming baselines in motion consistency, embodiment accuracy, and overall video quality. The work provides valuable resources for robotics VLA and world-model training and demonstrates the feasibility of scalable, web-scale robotized video generation, including in-the-wild Internet videos.
Abstract
The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.
