Table of Contents
Fetching ...

Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations

Irmak Guzey, Haozhi Qi, Julen Urain, Changhao Wang, Jessica Yin, Krishna Bodduluri, Mike Lambeta, Lerrel Pinto, Akshara Rai, Jitendra Malik, Tingfan Wu, Akash Sharma, Homanga Bharadhwaj

TL;DR

Aina presents a framework to learn dexterous, multi-fingered robot manipulation policies from in-the-wild human demonstrations collected with Aria Gen 2 smart glasses, eliminating the need for robot data or simulation. The method triangulates 3D hand keypoints and object point clouds from human videos, aligns them to a robot reference frame using a single in-scene demo, and trains a transformer-based 3D policy that predicts future fingertip trajectories, which are then mapped to robot joints via an inverse-kinematics module for deployment. Evaluations across nine everyday tasks show that Aina outperforms image-based baselines and generalizes across spatial configurations and some new objects, while maintaining robustness to background changes. The approach advances scalable, generalizable dexterous manipulation by leveraging rich sensing from wearable devices, though it recognizes limitations in force feedback and depth alignment between wearables and deployment systems, suggesting clear directions for future work with additional sensing and hardware integration.

Abstract

Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream. AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across nine everyday manipulation tasks. Robot rollouts are best viewed on our website: https://aina-robot.github.io.

Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations

TL;DR

Aina presents a framework to learn dexterous, multi-fingered robot manipulation policies from in-the-wild human demonstrations collected with Aria Gen 2 smart glasses, eliminating the need for robot data or simulation. The method triangulates 3D hand keypoints and object point clouds from human videos, aligns them to a robot reference frame using a single in-scene demo, and trains a transformer-based 3D policy that predicts future fingertip trajectories, which are then mapped to robot joints via an inverse-kinematics module for deployment. Evaluations across nine everyday tasks show that Aina outperforms image-based baselines and generalizes across spatial configurations and some new objects, while maintaining robustness to background changes. The approach advances scalable, generalizable dexterous manipulation by leveraging rich sensing from wearable devices, though it recognizes limitations in force feedback and depth alignment between wearables and deployment systems, suggesting clear directions for future work with additional sensing and hardware integration.

Abstract

Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream. AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across nine everyday manipulation tasks. Robot rollouts are best viewed on our website: https://aina-robot.github.io.

Paper Structure

This paper contains 36 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Aina is a framework for learning multi-fingered policies from in-the-wild human data collected with smart glasses, without requiring any robot data (including online corrections or simulation). The workflow is as follows: a human wears the Aria 2 glasses and collects in-the-wild demonstrations on any surface with arbitrary backgrounds (left), then records a single demonstration in the robot deployment space (middle), after which point-based policies are trained and directly deployed on the robot (right). With an average of just 15 minutes of human video collection effort, Aina is able to train autonomous robot policies.
  • Figure 2: Comparison of Aina's capabilities with some prior human-to-robot learning frameworks. In-The-Wild indicates whether data can be easily collected in natural settings outside the lab. Sensors describes the sensory outputs available from the data collection devices. Learning Extractions specifies which extractions can be utilized with the provided sensors to improve learning. Data Embodiment refers to the embodiment of the collected data (robot vs. human). Here, we also count online corrections wang2024dexcap and reinforcement learning hudor performed on the robot as part of the robot data. Robot Embodiment indicates which type of robot embodiment the framework targets (two-fingered gripper vs. multi-fingered hand). In Aina, we choose point-based approaches for their robustness to background variations, enabling robot learning from in-the-wild data for dexterous hands. This is made possible by the advanced sensing capabilities of the Aria Gen 2 glasses, which provide all the necessary 3D extractions.
  • Figure 3: Illustration of our overall Aina framework. On the left, we show how the data is processed: the human hand pose is extracted directly by the Aria Gen 2 glasses, and stereo depth is estimated from the surrounding SLAM camera frames. This enables the 3D policy learning methods on the right to succeed while remaining robust to background clutter.
  • Figure 4: Illustration of our robot setup.
  • Figure 5: Robot rollouts of Aina across nine tasks. Spatial generalization is shown in the leftmost column for each task. The meaning of each symbol is explained below the figure. Dotted lines indicate the object's orientation; when not shown, the orientation remains the same as in the showcased rollout. For the Oven Opening task, we showcase Aina's performance when there is background disturbance.
  • ...and 4 more figures