Table of Contents
Fetching ...

Masquerade: Learning from In-the-wild Human Videos using Data-Editing

Marion Lepert, Jiaying Fang, Jeannette Bohg

TL;DR

Masquerade closes the visual embodiment gap between humans and robots by editing in-the-wild egocentric videos into robotized demonstrations, pretraining a vision encoder on 675K frames, and cotraining with a diffusion-based policy head using only 50 robot demos. This combination enables robust zero-shot transfer to unseen multi-step tasks and environments, outperforming baselines by large margins and showing that both robot overlays and cotraining are essential. The work demonstrates scalable robot learning from web-scale human video data, with clear directions for improving overlays, depth reasoning, and retargeting to dexterous manipulators. Overall, Masquerade provides a practical pathway to leverage abundant human video data for long-horizon robot manipulation in diverse settings.

Abstract

Robot manipulation research still suffers from significant data scarcity: even the largest robot datasets are orders of magnitude smaller and less diverse than those that fueled recent breakthroughs in language and vision. We introduce Masquerade, a method that edits in-the-wild egocentric human videos to bridge the visual embodiment gap between humans and robots and then learns a robot policy with these edited videos. Our pipeline turns each human video into robotized demonstrations by (i) estimating 3-D hand poses, (ii) inpainting the human arms, and (iii) overlaying a rendered bimanual robot that tracks the recovered end-effector trajectories. Pre-training a visual encoder to predict future 2-D robot keypoints on 675K frames of these edited clips, and continuing that auxiliary loss while fine-tuning a diffusion policy head on only 50 robot demonstrations per task, yields policies that generalize significantly better than prior work. On three long-horizon, bimanual kitchen tasks evaluated in three unseen scenes each, Masquerade outperforms baselines by 5-6x. Ablations show that both the robot overlay and co-training are indispensable, and performance scales logarithmically with the amount of edited human video. These results demonstrate that explicitly closing the visual embodiment gap unlocks a vast, readily available source of data from human videos that can be used to improve robot policies.

Masquerade: Learning from In-the-wild Human Videos using Data-Editing

TL;DR

Masquerade closes the visual embodiment gap between humans and robots by editing in-the-wild egocentric videos into robotized demonstrations, pretraining a vision encoder on 675K frames, and cotraining with a diffusion-based policy head using only 50 robot demos. This combination enables robust zero-shot transfer to unseen multi-step tasks and environments, outperforming baselines by large margins and showing that both robot overlays and cotraining are essential. The work demonstrates scalable robot learning from web-scale human video data, with clear directions for improving overlays, depth reasoning, and retargeting to dexterous manipulators. Overall, Masquerade provides a practical pathway to leverage abundant human video data for long-horizon robot manipulation in diverse settings.

Abstract

Robot manipulation research still suffers from significant data scarcity: even the largest robot datasets are orders of magnitude smaller and less diverse than those that fueled recent breakthroughs in language and vision. We introduce Masquerade, a method that edits in-the-wild egocentric human videos to bridge the visual embodiment gap between humans and robots and then learns a robot policy with these edited videos. Our pipeline turns each human video into robotized demonstrations by (i) estimating 3-D hand poses, (ii) inpainting the human arms, and (iii) overlaying a rendered bimanual robot that tracks the recovered end-effector trajectories. Pre-training a visual encoder to predict future 2-D robot keypoints on 675K frames of these edited clips, and continuing that auxiliary loss while fine-tuning a diffusion policy head on only 50 robot demonstrations per task, yields policies that generalize significantly better than prior work. On three long-horizon, bimanual kitchen tasks evaluated in three unseen scenes each, Masquerade outperforms baselines by 5-6x. Ablations show that both the robot overlay and co-training are indispensable, and performance scales logarithmically with the amount of edited human video. These results demonstrate that explicitly closing the visual embodiment gap unlocks a vast, readily available source of data from human videos that can be used to improve robot policies.

Paper Structure

This paper contains 30 sections, 4 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Overview of Masquerade. Left: Large‑scale in‑the‑wild egocentric human videos are edited to obtain “robotized” demonstrations that bridge the visual embodiment gap. A vision representation is pre‑trained to predict future 2D robot poses on 675K frames of these edited clips. Center: the vision representation is co-trained with a diffusion policy head on 50 real robot demonstrations collected in a single scene. Right: The resulting policy is deployed zero‑shot in previously unseen environments, achieving significantly more robust manipulation performance than baselines despite domain shifts.
  • Figure 2: Overview of Masquerade. (1) In-the-wild egocentric human videos are converted into “robotized” clips by extracting 2D hand poses, inpainting out the human arms, and overlaying a rendered bimanual robot in the same pose. (2) A ViT-Base vision encoder is pretrained on these edited videos using a 2D keypoint regression loss. (3) During cotraining, the encoder and a diffusion-based policy head are jointly optimized on a mix of edited human videos (auxiliary 2D loss) and real robot demonstrations (imitation loss).
  • Figure 3: Scenes used for each task in in-distribution (center) versus out-of-distribution (right) settings; the first row represents the Stack Pots scenes, the middle the Scrape Potato scenes, and the bottom row the Sweep Chilis scenes.
  • Figure 4: Average success rate (%) on three bimanual tasks—Stack Pots, Scrape Potato, Sweep Chilis. Each task is evaluated over three out-of-distribution scenes (10 rollouts per scene, 30 per task). Our method, Masquerade, substantially outperforms all baselines; error bars show ± SEM.
  • Figure 5: Ablation study on the the Stack pots, Scrape Potato and Sweep Chilis tasks demonstrating that both robot overlays and co-training are essential for achieving robust success rates in out-of-distribution settings. Results are evaluated in OOD scene 1. 25 rollouts per bar.
  • ...and 6 more figures