EgoMimic: Scaling Imitation Learning via Egocentric Video
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, Danfei Xu
TL;DR
EgoMimic addresses the data bottleneck in imitation learning by leveraging passive, egocentric human embodiment data captured with Project Aria glasses alongside low-cost robotic demonstrations. It unifies human and robot data through hardware alignment, cross-embodiment data processing, and a shared vision-and-transformer policy that can output both pose predictions and joint-space actions. Across three real-world, long-horizon tasks, EgoMimic achieves substantial in-domain performance gains, demonstrates generalization to unseen objects and scenes, and exhibits favorable data-scaling properties by valuing hand data more than equivalent hours of robot data. The work provides a scalable pathway to Internet-scale robot learning by treating human data as a first-class embodiment rather than a peripheral source.
Abstract
The scale and diversity of demonstration data required for imitation learning is a significant challenge. We present EgoMimic, a full-stack framework which scales manipulation via human embodiment data, specifically egocentric human videos paired with 3D hand tracking. EgoMimic achieves this through: (1) a system to capture human embodiment data using the ergonomic Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, (3) cross-domain data alignment techniques, and (4) an imitation learning architecture that co-trains on human and robot data. Compared to prior works that only extract high-level intent from human videos, our approach treats human and robot data equally as embodied demonstration data and learns a unified policy from both data sources. EgoMimic achieves significant improvement on a diverse set of long-horizon, single-arm and bimanual manipulation tasks over state-of-the-art imitation learning methods and enables generalization to entirely new scenes. Finally, we show a favorable scaling trend for EgoMimic, where adding 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data. Videos and additional information can be found at https://egomimic.github.io/
