Table of Contents
Fetching ...

EmbodiSwap for Zero-Shot Robot Imitation Learning

Eadom Dessalene, Pavan Mantripragada, Michael Maynord, Yiannis Aloimonos

TL;DR

This work tackles zero-shot robot imitation learning by leveraging abundant in-the-wild human egocentric video. It introduces EmbodiSwap to generate photorealistic robot overlays on human footage and trains a closed-loop policy using a V-JEPA backbone to forecast relative end-effector transforms. The approach achieves an 82% real-world success rate and outperforms few-shot baselines, while also demonstrating the superiority of feature-level video-prediction pretraining for end-effector forecasting. By releasing the synthetic robot-overlay dataset, code, and model checkpoints, the work advances scalable cross-embodiment imitation and reduces the need for robot-specific demonstrations across tasks and environments.

Abstract

We introduce EmbodiSwap - a method for producing photorealistic synthetic robot overlays over human video. We employ EmbodiSwap for zero-shot imitation learning, bridging the embodiment gap between in-the-wild ego-centric human video and a target robot embodiment. We train a closed-loop robot manipulation policy over the data produced by EmbodiSwap. We make novel use of V-JEPA as a visual backbone, repurposing V-JEPA from the domain of video understanding to imitation learning over synthetic robot videos. Adoption of V-JEPA outperforms alternative vision backbones more conventionally used within robotics. In real-world tests, our zero-shot trained V-JEPA model achieves an $82\%$ success rate, outperforming a few-shot trained $π_0$ network as well as $π_0$ trained over data produced by EmbodiSwap. We release (i) code for generating the synthetic robot overlays which takes as input human videos and an arbitrary robot URDF and generates a robot dataset, (ii) the robot dataset we synthesize over EPIC-Kitchens, HOI4D and Ego4D, and (iii) model checkpoints and inference code, to facilitate reproducible research and broader adoption.

EmbodiSwap for Zero-Shot Robot Imitation Learning

TL;DR

This work tackles zero-shot robot imitation learning by leveraging abundant in-the-wild human egocentric video. It introduces EmbodiSwap to generate photorealistic robot overlays on human footage and trains a closed-loop policy using a V-JEPA backbone to forecast relative end-effector transforms. The approach achieves an 82% real-world success rate and outperforms few-shot baselines, while also demonstrating the superiority of feature-level video-prediction pretraining for end-effector forecasting. By releasing the synthetic robot-overlay dataset, code, and model checkpoints, the work advances scalable cross-embodiment imitation and reduces the need for robot-specific demonstrations across tasks and environments.

Abstract

We introduce EmbodiSwap - a method for producing photorealistic synthetic robot overlays over human video. We employ EmbodiSwap for zero-shot imitation learning, bridging the embodiment gap between in-the-wild ego-centric human video and a target robot embodiment. We train a closed-loop robot manipulation policy over the data produced by EmbodiSwap. We make novel use of V-JEPA as a visual backbone, repurposing V-JEPA from the domain of video understanding to imitation learning over synthetic robot videos. Adoption of V-JEPA outperforms alternative vision backbones more conventionally used within robotics. In real-world tests, our zero-shot trained V-JEPA model achieves an success rate, outperforming a few-shot trained network as well as trained over data produced by EmbodiSwap. We release (i) code for generating the synthetic robot overlays which takes as input human videos and an arbitrary robot URDF and generates a robot dataset, (ii) the robot dataset we synthesize over EPIC-Kitchens, HOI4D and Ego4D, and (iii) model checkpoints and inference code, to facilitate reproducible research and broader adoption.

Paper Structure

This paper contains 18 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of our training setup: Our system takes a sequence of frames $\{I_0, ..., I_T\}$ featuring a human actor performing an action as input. The first frame of this sequences is passed to a multi-step Robot Compositing process, producing an image $I^*_0$ with the human hand of $I_0$ substituted with a robot manipulator. The robot image $I^*_0$ is passed into V-JEPA encoder. The output of the encoder is passed along with a stack of positional mask tokens $M_{1:T}$ that correspond to frames $I_{1:T}$ to the V-JEPA predictor. The output of the predictor is then fed along with optional (represented by dashed lines) encoded representations of proprioception token $p_0$ and an action location token $l_0$ (both associated with $I_0$) into cross-attention layers $C$. $C$ produces as output a relative hand transform prediction, corresponding to a relative predicted hand transform from $I_0$ to $I_T$. Training is supervised using $L_1$ loss with a training signal of a relative 3D transform of the hand as derived between images $I_0$ and $I_T$ by the 3D Hand Reconstruction network. For clarity, model and 3D Hand Reconstruction output are overlayed onto $I_T$ as output in the figure.
  • Figure 2: Overview of our Zero-Shot test setup: After training V-JEPA over distributions of scenes, objects, and embodiments, we deploy the network on out-of-distribution embodiments, objects and environments. Test sequences — consisting of the first RGB frame plus positional masks for subsequent frames — are provided as input, producing as output a relative transformation predictions. We illustrate three example sequences, one each for: close, open, and pour. Input is shown on the left, and the consequence of action execution is shown on the right.
  • Figure 3: Overview of our robot compositing pipeline: The process begins with a human RGB frame in which a hand is visible. This image is processed first by three components: 1) a Body Segmentation Network, which produces a binary segmentation mask of the human actor; 2) a 3D Hand Extractor, which reconstructs the human hand skeleton in 3D; and 3) a Depth Network, which estimates accurate metric depth (composite of grayscaled image and depth image shown for visualization purposes). The output of these components is then further processed by an additional two components: 4) Image Inpainting, which takes the original RGB image and the body segmentation and erases the human actor and their effects from the scene. And, finally 5) Render and Blend, which takes all of the inpainted image, depth map, and the end-effector pose, and renders a synthetic robot manipulator, composites it into the scene, and adjusts foreground/background contents based on the depth differences between the scene and the robot.
  • Figure 4: Action and sub-action boundaries: A visualization of 3 example sequences taken from EPIC Kitchens. The top row within each example corresponds to cropped RGB frames. The purple arrow in the middle row corresponds to the ground truth annotated action and its temporal boundaries as provided by EPIC Kitchens. The ( red, green, blue, pink, and tan) arrows in the third row correspond to the sequences of Therblig sub-actions, and their temporal boundaries. Both solid and dashed arrows indicate the temporal extend of Therblig sub-actions. A dashed arrow indicates that the Therblig sub-action is extraneous, and a solid arrow indicates that the associated sub-action clip is used in training our system. We release our annotations publicly https://drive.google.com/drive/folders/1-UUywelBCOe-E_ErpoaAHQa4dgjq6AfH?usp=sharing.
  • Figure 5: Human Hand Pose to Gripper Pose Re-targeting: On the left we extract the MANO joint positions of the human hand. From this we extract a 6-DOF hand pose. We then align the robot gripper - on the right - to conform to this derived 6-DOF hand pose.
  • ...and 1 more figures