Hand-Object Interaction Pretraining from Videos

Himanshu Gaurav Singh; Antonio Loquercio; Carmelo Sferrazza; Jane Wu; Haozhi Qi; Pieter Abbeel; Jitendra Malik

Hand-Object Interaction Pretraining from Videos

Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, Jitendra Malik

TL;DR

Hand-Object Interaction Pretraining (HOP) leverages in-the-wild hand-object videos by lifting interactions to 3D, retargeting them into a physics-based simulator to generate a large sensorimotor dataset, and training a transformer-based base policy to predict next actions. This base policy is finetuned with RL or behavior cloning for downstream tasks, enabling sample-efficient adaptation and improved robustness. Real-world and simulation experiments demonstrate that HOP can generalize beyond the pretraining data and provide practical transfer benefits, though it shows some limitations on the hardest real-world task compared to strong image-based baselines. Overall, HOP offers a scalable approach to transfer human manipulation knowledge to dexterous robots using minimal task-specific demonstrations.

Abstract

We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: \url{https://hgaurav2k.github.io/hop/}.

Hand-Object Interaction Pretraining from Videos

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 6 figures, 1 table)

This paper contains 18 sections, 2 equations, 6 figures, 1 table.

Introduction
Overview
Method
Lifting Hand-Object Interaction Videos to 3D
Mapping 3D Human-Object Interactions to Robot-Object Interactions
Robot Trajectory Pretraining
Experimental Setup
Experimental Results
Comparison to visual pre-training baselines (real-world).
Comparison to demonstration-guided reinforcement learning strategies (simulation)
Comparison to learning a hand-only motion prior (simulation)
Related Work
Conclusion and Limitations
Supplementary Material
3D Hand-Object Interaction from Videos
...and 3 more sections

Figures (6)

Figure 1: Real world rollouts of the policy finetuned from HOP using less than 50 demonstrations. HOP enables sample-efficient downstream adaptaion by learning a general manipulation prior from human videos.
Figure 2: 3-D hand-object trajectories from in-the-wild human manipulation videos are re-targeted to a robot embodiment within a physics simulator, resulting in physically grounded robot data. General manipulation priors are learnt from this using generative modelling of trajectories. Such representation enables sample-efficient adaptation for new downstream tasks.
Figure 3: Comparison of HOP-initialized actor with baselines. HOP improves sample-efficiency of online RL across multiple tasks, particularly when the downstream task and the behaviors in the data are less aligned, as in Lift & Throw. Runs are averaged across three randomly chosen seeds.
Figure 4: Evaluating RL finetuning under out-of-distribution scenarios (Left) To test grasp robustness in the task Grasp & Lift, we apply to the grasped objects, forces in random direction equal to their weights. When initialized with HOP, the resulting policy is more than $3\times$ more robust compared to training PPO from scratch. (Right) We evaluate grasp success on multiple objects from the YCB dataset that were not part of the training set. When initialized with HOP, the resulting policy is more than $2\times$ more robust compared to training PPO from scratch.
Figure 5: Online exploration around the learnt prior from humans leads to grasps with more human-like and stable affordances compared to training PPO from scratch.
...and 1 more figures

Hand-Object Interaction Pretraining from Videos

TL;DR

Abstract

Hand-Object Interaction Pretraining from Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (6)