Phantom: Training Robots Without Robots Using Only Human Videos
Marion Lepert, Jiaying Fang, Jeannette Bohg
TL;DR
The paper addresses the bottleneck of collecting robotic data by showing that diverse human video demonstrations can be transformed into robot-like training data through targeted data editing. It combines hand-pose-based action labeling with inpainting-based embodiment editing to create a robot observation-action dataset from human videos, enabling closed-loop imitation learning that can deploy zero-shot on real robots without robot data. The approach is demonstrated across multiple tasks and robots, with ablations showing the effectiveness of train-time and test-time data editing, the importance of high-quality inpainting, and the benefit of co-training with diverse human data. The work suggests a scalable path toward large-scale, robot-agnostic learning by leveraging readily available human demonstrations, potentially integrating with generalist, large-scale robotics policies.
Abstract
Scaling robotics data collection is critical to advancing general-purpose robots. Current approaches often rely on teleoperated demonstrations which are difficult to scale. We propose a novel data collection method that eliminates the need for robotics hardware by leveraging human video demonstrations. By training imitation learning policies on this human data, our approach enables zero-shot deployment on robots without collecting any robot-specific data. To bridge the embodiment gap between human and robot appearances, we utilize a data editing approach on the input observations that aligns the image distributions between training data on humans and test data on robots. Our method significantly reduces the cost of diverse data collection by allowing anyone with an RGBD camera to contribute. We demonstrate that our approach works in diverse, unseen environments and on varied tasks.
