Table of Contents
Fetching ...

EgoZero: Robot Learning from Smart Glasses

Vincent Liu, Ademi Adeniji, Haotian Zhan, Siddhant Haldar, Raunaq Bhirangi, Pieter Abbeel, Lerrel Pinto

TL;DR

EgoZero tackles the data bottleneck in real-world robotics by learning zero-shot manipulation policies from in-the-wild egocentric human demonstrations captured with Project Aria glasses, without any robot data. It unifies human and robot domains using ego-centric 3D point representations and trains a closed-loop Transformer policy via behavior cloning on this shared space, relying on triangulated object points and hand-pose cues. The approach demonstrates 70% zero-shot success across seven tasks on a Franka Panda, with only 20 minutes of human data per task and strong generalization to new viewpoints, object poses, and instances. This work suggests that scalable, diverse human data can serve as a practical foundation for real-world robot learning, paving the way for more human-centric and data-efficient robotics research.

Abstract

Despite recent progress in general purpose robotics, robot policies still lag far behind basic human capabilities in the real world. Humans interact constantly with the physical world, yet this rich data resource remains largely untapped in robot learning. We propose EgoZero, a minimal system that learns robust manipulation policies from human demonstrations captured with Project Aria smart glasses, $\textbf{and zero robot data}$. EgoZero enables: (1) extraction of complete, robot-executable actions from in-the-wild, egocentric, human demonstrations, (2) compression of human visual observations into morphology-agnostic state representations, and (3) closed-loop policy learning that generalizes morphologically, spatially, and semantically. We deploy EgoZero policies on a gripper Franka Panda robot and demonstrate zero-shot transfer with 70% success rate over 7 manipulation tasks and only 20 minutes of data collection per task. Our results suggest that in-the-wild human data can serve as a scalable foundation for real-world robot learning - paving the way toward a future of abundant, diverse, and naturalistic training data for robots. Code and videos are available at https://egozero-robot.github.io.

EgoZero: Robot Learning from Smart Glasses

TL;DR

EgoZero tackles the data bottleneck in real-world robotics by learning zero-shot manipulation policies from in-the-wild egocentric human demonstrations captured with Project Aria glasses, without any robot data. It unifies human and robot domains using ego-centric 3D point representations and trains a closed-loop Transformer policy via behavior cloning on this shared space, relying on triangulated object points and hand-pose cues. The approach demonstrates 70% zero-shot success across seven tasks on a Franka Panda, with only 20 minutes of human data per task and strong generalization to new viewpoints, object poses, and instances. This work suggests that scalable, diverse human data can serve as a practical foundation for real-world robot learning, paving the way for more human-centric and data-efficient robotics research.

Abstract

Despite recent progress in general purpose robotics, robot policies still lag far behind basic human capabilities in the real world. Humans interact constantly with the physical world, yet this rich data resource remains largely untapped in robot learning. We propose EgoZero, a minimal system that learns robust manipulation policies from human demonstrations captured with Project Aria smart glasses, . EgoZero enables: (1) extraction of complete, robot-executable actions from in-the-wild, egocentric, human demonstrations, (2) compression of human visual observations into morphology-agnostic state representations, and (3) closed-loop policy learning that generalizes morphologically, spatially, and semantically. We deploy EgoZero policies on a gripper Franka Panda robot and demonstrate zero-shot transfer with 70% success rate over 7 manipulation tasks and only 20 minutes of data collection per task. Our results suggest that in-the-wild human data can serve as a scalable foundation for real-world robot learning - paving the way toward a future of abundant, diverse, and naturalistic training data for robots. Code and videos are available at https://egozero-robot.github.io.

Paper Structure

This paper contains 20 sections, 7 equations, 13 figures, 1 table, 1 algorithm.

Figures (13)

  • Figure 1: EgoZero trains policies in a unified state-action space defined as egocentric 3D points. Unlike previous methods which leverage multi-camera calibration and depth sensors, EgoZero localizes object points via triangulation over the camera trajectory, and computes action points via Aria MPS hand pose and a hand estimation model. These points supervise a closed-loop Transformer policy, which is rolled out on unprojected points from an iPhone during inference.
  • Figure 2: Our 7 tasks. Top: open oven door, put bread on plate, sweep board with broom, erase board. Bottom: sort fruit, fold towel, and insert book in shelf. See Appendix \ref{['appendix:tasks']} for full trajectories.
  • Figure 3: Distribution of bread keypoints for "Put bread in plate" task. The columns are projections of the 3D space onto each 2D plane. The policy generalizes to object poses far outside of its training volume and begins to fail when the objects are near the limits of its augmented volume.
  • Figure 4: Object semantic generalization. Human demonstrations are done with only black ovens (top). The policy transfers zero-shot to the robot with the same oven (middle) and also generalizes to a new oven instance (bottom). The points are color-coded to represent the correspondence.
  • Figure 5: Open oven door.
  • ...and 8 more figures