Table of Contents
Fetching ...

MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos

Alexey Gavryushin, Xi Wang, Robert J. S. Malate, Chenyu Yang, Davide Liconti, René Zurbrügg, Robert K. Katzschmann, Marc Pollefeys

TL;DR

MAPLE addresses the challenge of dexterous robotic manipulation by learning manipulation priors from large-scale egocentric videos. It trains a visual encoder to predict fine-grained hand-object contact points and 3D hand poses at contact, producing representations that significantly improve downstream visuomotor policies, validated across eight dexterous tasks in simulation and real-world deployment. The approach combines an automated data extraction pipeline, hand pose tokenization, and a diffusion-based policy to achieve strong generalization and robustness, including zero-shot real-world variations. The work also introduces new dexterous simulation tasks and provides a pathway for public release of code and data to spur future research.

Abstract

Large-scale egocentric video datasets capture diverse human activities across a wide range of scenarios, offering rich and detailed insights into how humans interact with objects, especially those that require fine-grained dexterous control. Such complex, dexterous skills with precise controls are crucial for many robotic manipulation tasks, yet are often insufficiently addressed by traditional data-driven approaches to robotic manipulation. To address this gap, we leverage manipulation priors learned from large-scale egocentric video datasets to improve policy learning for dexterous robotic manipulation tasks. We present MAPLE, a novel method for dexterous robotic manipulation that learns features to predict object contact points and detailed hand poses at the moment of contact from egocentric images. We then use the learned features to train policies for downstream manipulation tasks. Experimental results demonstrate the effectiveness of MAPLE across 4 existing simulation benchmarks, as well as a newly designed set of 4 challenging simulation tasks requiring fine-grained object control and complex dexterous skills. The benefits of MAPLE are further highlighted in real-world experiments using a 17 DoF dexterous robotic hand, whereas the simultaneous evaluation across both simulation and real-world experiments has remained underexplored in prior work. We additionally showcase the efficacy of our model on an egocentric contact point prediction task, validating its usefulness beyond dexterous manipulation policy learning.

MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos

TL;DR

MAPLE addresses the challenge of dexterous robotic manipulation by learning manipulation priors from large-scale egocentric videos. It trains a visual encoder to predict fine-grained hand-object contact points and 3D hand poses at contact, producing representations that significantly improve downstream visuomotor policies, validated across eight dexterous tasks in simulation and real-world deployment. The approach combines an automated data extraction pipeline, hand pose tokenization, and a diffusion-based policy to achieve strong generalization and robustness, including zero-shot real-world variations. The work also introduces new dexterous simulation tasks and provides a pathway for public release of code and data to spur future research.

Abstract

Large-scale egocentric video datasets capture diverse human activities across a wide range of scenarios, offering rich and detailed insights into how humans interact with objects, especially those that require fine-grained dexterous control. Such complex, dexterous skills with precise controls are crucial for many robotic manipulation tasks, yet are often insufficiently addressed by traditional data-driven approaches to robotic manipulation. To address this gap, we leverage manipulation priors learned from large-scale egocentric video datasets to improve policy learning for dexterous robotic manipulation tasks. We present MAPLE, a novel method for dexterous robotic manipulation that learns features to predict object contact points and detailed hand poses at the moment of contact from egocentric images. We then use the learned features to train policies for downstream manipulation tasks. Experimental results demonstrate the effectiveness of MAPLE across 4 existing simulation benchmarks, as well as a newly designed set of 4 challenging simulation tasks requiring fine-grained object control and complex dexterous skills. The benefits of MAPLE are further highlighted in real-world experiments using a 17 DoF dexterous robotic hand, whereas the simultaneous evaluation across both simulation and real-world experiments has remained underexplored in prior work. We additionally showcase the efficacy of our model on an egocentric contact point prediction task, validating its usefulness beyond dexterous manipulation policy learning.

Paper Structure

This paper contains 35 sections, 8 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos. We present MAPLE, a framework that learns dexterous manipulation priors from egocentric videos and produces features well-suited for downstream dexterous robotic manipulation tasks. Experiments in both simulation and real-world settings demonstrate that MAPLE enables efficient policy learning and improves generalization across various tasks.
  • Figure 2: Overview of MAPLE. Given a single input frame, the encoder is trained to reason about hand-object interactions, specifically predicting contact points and grasping hand poses. This training infuses a manipulation prior into the learned feature representation, making it well-suited for downstream robotic manipulation. Features extracted from the frozen visual encoder, combined with robotic hand positions, are fed into a Transformer-based diffusion policy network to predict dexterous hand action sequences.
  • Figure 3: Manipulation Prior Modeling. We learn manipulation priors by predicting future contact points and hand poses from an input frame. Training data are extracted by identifying a contact frame$f_c$ and a preceding prediction frame$f_p$, which is used as input. Contact points and hand poses are extracted from the contact frame, and a point tracker is used to back-project the contact locations onto the prediction frame.
  • Figure 4: Simulated Evaluation Environments. We evaluate our method on four environments from DAPG DAPG (a-d) and propose four new robotic environments (e-h). Our new environments aim to evaluate the manipulation capabilities of a set of objects commonly used by humans, namely a pan, a brush, a drill, and a clothes iron.
  • Figure 5: Real World Sequences of the Evaluated Tasks. Example rollouts of MAPLE. From top to bottom: 'Open the bottle cap' (bottle), 'Place the pan' (pan) and 'Wash the dish' (sponge) tasks using the ORCA hand and the Franka Emika Panda manipulator.
  • ...and 15 more figures