Table of Contents
Fetching ...

MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training

Haoyun Li, Ivan Zhang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Zhiqin Yang, Zhentao Zhang, Boyuan Wang, Chaojun Ni, Wenkang Qin, Xinze Chen, Yun Ye, Guan Huang, Zhenbo Song, Xingang Wang

TL;DR

MimicDreamer tackles the data bottleneck in Vision Language Action by converting abundant egocentric human demonstrations into robot-usable supervision. It coherently stabilizes viewpoints (EgoStabilizer), translates human motions into robot actions (IK-based mapping), and bridges visual gaps (H2R Aligner) to produce paired robot-domain training data. The VLA policy trained on this synthesized data achieves few-shot real-robot execution and shows scalable improvements as more human data is added, with an average 14.7% gain across six tasks. This work enables scalable, cost-effective VLA training and demonstrates strong potential for cross-domain generalization with reduced dependence on robot-time data collection.

Abstract

Vision Language Action (VLA) models derive their generalization capability from diverse training data, yet collecting embodied robot interaction data remains prohibitively expensive. In contrast, human demonstration videos are far more scalable and cost-efficient to collect, and recent studies confirm their effectiveness in training VLA models. However, a significant domain gap persists between human videos and robot-executed videos, including unstable camera viewpoints, visual discrepancies between human hands and robotic arms, and differences in motion dynamics. To bridge this gap, we propose MimicDreamer, a framework that turns fast, low-cost human demonstrations into robot-usable supervision by jointly aligning vision, viewpoint, and actions to directly support policy training. For visual alignment, we propose H2R Aligner, a video diffusion model that generates high-fidelity robot demonstration videos by transferring motion from human manipulation footage. For viewpoint stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos via homography and inpaints occlusions and distortions caused by warping. For action alignment, we map human hand trajectories to the robot frame and apply a constrained inverse kinematics solver to produce feasible, low-jitter joint commands with accurate pose tracking. Empirically, VLA models trained purely on our synthesized human-to-robot videos achieve few-shot execution on real robots. Moreover, scaling training with human data significantly boosts performance compared to models trained solely on real robot data; our approach improves the average success rate by 14.7\% across six representative manipulation tasks.

MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training

TL;DR

MimicDreamer tackles the data bottleneck in Vision Language Action by converting abundant egocentric human demonstrations into robot-usable supervision. It coherently stabilizes viewpoints (EgoStabilizer), translates human motions into robot actions (IK-based mapping), and bridges visual gaps (H2R Aligner) to produce paired robot-domain training data. The VLA policy trained on this synthesized data achieves few-shot real-robot execution and shows scalable improvements as more human data is added, with an average 14.7% gain across six tasks. This work enables scalable, cost-effective VLA training and demonstrates strong potential for cross-domain generalization with reduced dependence on robot-time data collection.

Abstract

Vision Language Action (VLA) models derive their generalization capability from diverse training data, yet collecting embodied robot interaction data remains prohibitively expensive. In contrast, human demonstration videos are far more scalable and cost-efficient to collect, and recent studies confirm their effectiveness in training VLA models. However, a significant domain gap persists between human videos and robot-executed videos, including unstable camera viewpoints, visual discrepancies between human hands and robotic arms, and differences in motion dynamics. To bridge this gap, we propose MimicDreamer, a framework that turns fast, low-cost human demonstrations into robot-usable supervision by jointly aligning vision, viewpoint, and actions to directly support policy training. For visual alignment, we propose H2R Aligner, a video diffusion model that generates high-fidelity robot demonstration videos by transferring motion from human manipulation footage. For viewpoint stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos via homography and inpaints occlusions and distortions caused by warping. For action alignment, we map human hand trajectories to the robot frame and apply a constrained inverse kinematics solver to produce feasible, low-jitter joint commands with accurate pose tracking. Empirically, VLA models trained purely on our synthesized human-to-robot videos achieve few-shot execution on real robots. Moreover, scaling training with human data significantly boosts performance compared to models trained solely on real robot data; our approach improves the average success rate by 14.7\% across six representative manipulation tasks.

Paper Structure

This paper contains 54 sections, 24 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of MimicDreamer. Viewpoint branch (top left): egocentric videos are stabilized by EgoStabilizer (warp perspective + background inpainting) to produce stable egocentric videos. Camera intrinsics/extrinsics and the robot URDF drive sim rendering to generate additional stable ego views. Action branch (bottom left): 3D hand trajectories are converted to robot actions with IK solver. Visual alignment (right): H2R Aligner learns to bridge the human-to-robot visual gap using stable egocentric videos and simulation robot videos. The resulting synthesized robot videos and robot actions are used for VLA training.
  • Figure 2: H2R Aligner. During training, the real robot video $V_{\mathrm{gt}}$, background $V_{\mathrm{scene}}$, and simulated foreground $V_{\mathrm{sim}}$ are encoded by a frozen VAE and channel-concatenated as $[\tilde{z}_{\mathrm{tar}},\, z_{\mathrm{scene}},\, z_{\mathrm{sim}}]$ before entering the trainable H2R DiT, optimized with CogVideoXLoss loss. During inference, a hand-masked human background and IK-replayed simulation serve as conditions; the target starts from noise, is denoised by H2R DiT, and decoded by the frozen VAE into synthesized robot videos.
  • Figure 3: Scaling Experiment Results. As more human-to-robot data is added, the MimicDreamer's success rate monotonically increases across all six tasks.
  • Figure 4: Visual Results of H2R Aligner. Top: original human demonstration video. Middle: replayed robot simulation from the same action trajectories. Bottom: synthesized robot-domain video generated by H2R Aligner. The generated sequences transfer human motions into robot-arm appearances while preserving background context and manipulation semantics.
  • Figure 5: Qualitative evaluation of EgoStabilizer. On a 300-frame Clean Surface video, frames at indices 0, 150, and 300 are shown beforeand after stabilization. Keypoints such as wall corners and table–image intersections exhibit large jitter in the original video, whereas the stabilized outputs show negligible displacement, confirming effective viewpoint stabilization.
  • ...and 3 more figures