Table of Contents
Fetching ...

Phantom: Training Robots Without Robots Using Only Human Videos

Marion Lepert, Jiaying Fang, Jeannette Bohg

TL;DR

The paper addresses the bottleneck of collecting robotic data by showing that diverse human video demonstrations can be transformed into robot-like training data through targeted data editing. It combines hand-pose-based action labeling with inpainting-based embodiment editing to create a robot observation-action dataset from human videos, enabling closed-loop imitation learning that can deploy zero-shot on real robots without robot data. The approach is demonstrated across multiple tasks and robots, with ablations showing the effectiveness of train-time and test-time data editing, the importance of high-quality inpainting, and the benefit of co-training with diverse human data. The work suggests a scalable path toward large-scale, robot-agnostic learning by leveraging readily available human demonstrations, potentially integrating with generalist, large-scale robotics policies.

Abstract

Scaling robotics data collection is critical to advancing general-purpose robots. Current approaches often rely on teleoperated demonstrations which are difficult to scale. We propose a novel data collection method that eliminates the need for robotics hardware by leveraging human video demonstrations. By training imitation learning policies on this human data, our approach enables zero-shot deployment on robots without collecting any robot-specific data. To bridge the embodiment gap between human and robot appearances, we utilize a data editing approach on the input observations that aligns the image distributions between training data on humans and test data on robots. Our method significantly reduces the cost of diverse data collection by allowing anyone with an RGBD camera to contribute. We demonstrate that our approach works in diverse, unseen environments and on varied tasks.

Phantom: Training Robots Without Robots Using Only Human Videos

TL;DR

The paper addresses the bottleneck of collecting robotic data by showing that diverse human video demonstrations can be transformed into robot-like training data through targeted data editing. It combines hand-pose-based action labeling with inpainting-based embodiment editing to create a robot observation-action dataset from human videos, enabling closed-loop imitation learning that can deploy zero-shot on real robots without robot data. The approach is demonstrated across multiple tasks and robots, with ablations showing the effectiveness of train-time and test-time data editing, the importance of high-quality inpainting, and the benefit of co-training with diverse human data. The work suggests a scalable path toward large-scale, robot-agnostic learning by leveraging readily available human demonstrations, potentially integrating with generalist, large-scale robotics policies.

Abstract

Scaling robotics data collection is critical to advancing general-purpose robots. Current approaches often rely on teleoperated demonstrations which are difficult to scale. We propose a novel data collection method that eliminates the need for robotics hardware by leveraging human video demonstrations. By training imitation learning policies on this human data, our approach enables zero-shot deployment on robots without collecting any robot-specific data. To bridge the embodiment gap between human and robot appearances, we utilize a data editing approach on the input observations that aligns the image distributions between training data on humans and test data on robots. Our method significantly reduces the cost of diverse data collection by allowing anyone with an RGBD camera to contribute. We demonstrate that our approach works in diverse, unseen environments and on varied tasks.

Paper Structure

This paper contains 29 sections, 1 equation, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Overview of our data-editing pipeline for learning robot policies from human videos. During training, we first estimate the hand pose in each frame of a human video demonstration and convert it into a robot action. We then remove the human hand using inpainting and overlay a virtual robot in its place. The resulting augmented dataset is used to train an imitation learning policy, $\pi$. At test time, we overlay a virtual robot on real robot observations to ensure visual consistency, enabling direct deployment of the learned policy on a real robot.
  • Figure 2: Left: We use HaMeR to estimate the pose of the hand at each timestep. To refine the HaMeR predicted mesh points $\hat{\mathbf{V}}_t$ shown in green, we use ICP registration to align them with the partial point cloud of the hand, $\mathbf{P}_t$ to obtain $\mathbf{V}_t$. Right: After aligning the HaMeR keypoints with the hand point cloud, we calculate the target position $\mathbf{p}_t$ as the midpoint between the tips of the thumb and index finger and the target orientation by fitting a plane through the points of the thumb and index fingers.
  • Figure 3: The different data-editing strategies we compare for human-to-robot transfer. We evaluate three data-editing approaches: (1) Hand Inpaint, where the human hand is removed via inpainting and replaced with a rendered robot; (2) Hand Mask, where the human hand is blacked out during training, and a rendered robot is overlaid on top. At test time, a black mask of a human arm is added to match the training distribution; and (3) Red Line, where the human arm is blacked out and replaced with a red line during training, and at test time, the robot arm is blacked out and similarly overlaid with a red line. Both Hand Inpaint and Hand Mask achieve high success rates, but Hand Inpaint produces more realistic images and allows for faster rollouts.
  • Figure 4: The five tasks used to evaluate our method in an in-distribution scene on a Franka robot.
  • Figure 5: The out-of-distribution evaluation scenes used to evaluate the sweeping task on the Kinova robot.
  • ...and 5 more figures