Table of Contents
Fetching ...

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou

TL;DR

The paper tackles data scarcity in robotics by introducing X-Humanoid, a video-to-video editing framework that converts human videos into humanoid videos using a fine-tuned Wan 2.2 diffusion model. It addresses the lack of large-scale robot video data by generating a synthetic paired dataset in Unreal Engine and applying the model to real-world videos to produce a 60-hour, 3.6 million-frame robotized Ego-Exo4D dataset. Through LoRA-based finetuning and a flow-matching objective, the approach preserves human motion while faithfully depicting a fixed humanoid embodiment, outperforming baselines in motion consistency, embodiment accuracy, and overall video quality. The work provides valuable resources for robotics VLA and world-model training and demonstrates the feasibility of scalable, web-scale robotized video generation, including in-the-wild Internet videos.

Abstract

The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

TL;DR

The paper tackles data scarcity in robotics by introducing X-Humanoid, a video-to-video editing framework that converts human videos into humanoid videos using a fine-tuned Wan 2.2 diffusion model. It addresses the lack of large-scale robot video data by generating a synthetic paired dataset in Unreal Engine and applying the model to real-world videos to produce a 60-hour, 3.6 million-frame robotized Ego-Exo4D dataset. Through LoRA-based finetuning and a flow-matching objective, the approach preserves human motion while faithfully depicting a fixed humanoid embodiment, outperforming baselines in motion consistency, embodiment accuracy, and overall video quality. The work provides valuable resources for robotics VLA and world-model training and demonstrates the feasibility of scalable, web-scale robotized video generation, including in-the-wild Internet videos.

Abstract

The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.

Paper Structure

This paper contains 31 sections, 3 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Our proposed paradigm for "robotizing" human videos. (1) We adapt a Video Diffusion Transformer (DiT) into a video-to-video architecture. (2) This model is finetuned on a novel synthetic dataset of paired human-humanoid videos we generated using Unreal Engine. (3) At inference, the finetuned model translates real-world human videos (e.g., Ego-Exo4D egoexo4d) into a large-scale dataset of humanoid videos, which can be used to train downstream robotics VLA or world models.
  • Figure 2: Our synthetic data creation pipeline. We leverage rich community assets (e.g., Fab marketplace) for characters, animations, and environments (blue) to create paired human-humanoid videos in three steps: First (red), we align the skeletons of different characters and animations to ensure compatibility across all assets. Second (green), with the help of compatible skeletons, we transfer the same animation to both human and humanoid characters. Finally (yellow), we place the animated human and humanoid in diverse scenes and record them using identical camera setups and movements to produce the paired data.
  • Figure 3: Visualization of sample pairs from our synthesized Human-Humanoid video dataset. To ensure diversity, the dataset encompasses a wide variety of scenes, character motions, and camera parameters (e.g., focal length, exposure). We intentionally include challenging conditions, such as occlusions by obstacles and partial-body or off-center framing, to improve model robustness.
  • Figure 4: The network architecture of our method. The original Wan 2.2 model does not accept video input. It only contains generation tokens (red), which the model denoises to generate a video. Parallel to these generation tokens, we encode the input human video into condition tokens (blue) and concatenate all tokens. During self-attention, we apply a mask to prevent condition tokens from attending to generation tokens. At the output, we only retain the generation tokens and decode them to produce the edited video.
  • Figure 5: Qualitative comparison to baselines. The videos are sampled at frames 15, 30, 45, 60, and 75. MoCha mocha generally maintains the original character's motion, but it often produces incorrect details, such as the upper arm shape in (a) and the leg pose in (b). Kling kling generates the correct embodiment but fails on motion consistency, resulting in unsynchronized actions and undesirable scene alternations (e.g., dropping the green bag to the ground instead of putting it back in (a)). Runway Aleph aleph fails to produce both the correct robot embodiment (a) and the correct motion (b). In contrasts, our method successfully generates the correct embodiment and motions synchronized with the original human.
  • ...and 7 more figures