Table of Contents
Fetching ...

EgoMimic: Scaling Imitation Learning via Egocentric Video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, Danfei Xu

TL;DR

EgoMimic addresses the data bottleneck in imitation learning by leveraging passive, egocentric human embodiment data captured with Project Aria glasses alongside low-cost robotic demonstrations. It unifies human and robot data through hardware alignment, cross-embodiment data processing, and a shared vision-and-transformer policy that can output both pose predictions and joint-space actions. Across three real-world, long-horizon tasks, EgoMimic achieves substantial in-domain performance gains, demonstrates generalization to unseen objects and scenes, and exhibits favorable data-scaling properties by valuing hand data more than equivalent hours of robot data. The work provides a scalable pathway to Internet-scale robot learning by treating human data as a first-class embodiment rather than a peripheral source.

Abstract

The scale and diversity of demonstration data required for imitation learning is a significant challenge. We present EgoMimic, a full-stack framework which scales manipulation via human embodiment data, specifically egocentric human videos paired with 3D hand tracking. EgoMimic achieves this through: (1) a system to capture human embodiment data using the ergonomic Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, (3) cross-domain data alignment techniques, and (4) an imitation learning architecture that co-trains on human and robot data. Compared to prior works that only extract high-level intent from human videos, our approach treats human and robot data equally as embodied demonstration data and learns a unified policy from both data sources. EgoMimic achieves significant improvement on a diverse set of long-horizon, single-arm and bimanual manipulation tasks over state-of-the-art imitation learning methods and enables generalization to entirely new scenes. Finally, we show a favorable scaling trend for EgoMimic, where adding 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data. Videos and additional information can be found at https://egomimic.github.io/

EgoMimic: Scaling Imitation Learning via Egocentric Video

TL;DR

EgoMimic addresses the data bottleneck in imitation learning by leveraging passive, egocentric human embodiment data captured with Project Aria glasses alongside low-cost robotic demonstrations. It unifies human and robot data through hardware alignment, cross-embodiment data processing, and a shared vision-and-transformer policy that can output both pose predictions and joint-space actions. Across three real-world, long-horizon tasks, EgoMimic achieves substantial in-domain performance gains, demonstrates generalization to unseen objects and scenes, and exhibits favorable data-scaling properties by valuing hand data more than equivalent hours of robot data. The work provides a scalable pathway to Internet-scale robot learning by treating human data as a first-class embodiment rather than a peripheral source.

Abstract

The scale and diversity of demonstration data required for imitation learning is a significant challenge. We present EgoMimic, a full-stack framework which scales manipulation via human embodiment data, specifically egocentric human videos paired with 3D hand tracking. EgoMimic achieves this through: (1) a system to capture human embodiment data using the ergonomic Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, (3) cross-domain data alignment techniques, and (4) an imitation learning architecture that co-trains on human and robot data. Compared to prior works that only extract high-level intent from human videos, our approach treats human and robot data equally as embodied demonstration data and learns a unified policy from both data sources. EgoMimic achieves significant improvement on a diverse set of long-horizon, single-arm and bimanual manipulation tasks over state-of-the-art imitation learning methods and enables generalization to entirely new scenes. Finally, we show a favorable scaling trend for EgoMimic, where adding 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data. Videos and additional information can be found at https://egomimic.github.io/

Paper Structure

This paper contains 17 sections, 6 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: EgoMimic unlocks human embodiment data---egocentric videos paired with 3D hand tracks---as a new scalable data source for imitation learning. We can capture this data anywhere, without a robot, by wearing a pair of Project Aria glasses while performing manipulation tasks with our own hands. EgoMimic bridges kinematic, distributional, and appearance differences between human embodiment data (left) and traditional robot teleoperation data (right) to learn a unified policy. We find that human embodiment data boosts task performance by 34-228% over using robot data alone, and enables generalization to new objects or even scenes.
  • Figure 2: Our human data system uses Aria glasses to capture Egocentric RGB and uses its side SLAM cameras to localize the device and track hands. The robot consists of two Viper X follower arms with Intel RealSense D405 wrist cameras, controlled by two WidowX leader arms. Our robot uses identical Aria glasses as the main vision sensor to help minimize the camera to camera gap.
  • Figure 3: a) Action normalization: The pose distributions are different between hand and robot data, specifically in the $y$ (left-right) dimension. We apply Gaussian normalization individually to the hand and robot pose data before feeding them to the model. b) Visual masking: To help bridge the appearance gap of human and and the robot arm, we apply a black mask to the hand and robot via SAM, then overlay a red line onto the image.
  • Figure 4: Architecture of the joint human-robot policy learning framework. The model processes normalized hand and robot data through shared vision and ACT encoders, outputting pose predictions for both human and robot data, and joint actions for robot data. The framework uses masked images to mitigate human-robot appearance gaps and incorporates wrist camera views for the robot.
  • Figure 5: We evaluate EgoMimic across three real world, long-horizon manipulation tasks. See Sec. \ref{['sec:experimentalSetup']} for description.
  • ...and 6 more figures