Hand-Object Interaction Pretraining from Videos
Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, Jitendra Malik
TL;DR
Hand-Object Interaction Pretraining (HOP) leverages in-the-wild hand-object videos by lifting interactions to 3D, retargeting them into a physics-based simulator to generate a large sensorimotor dataset, and training a transformer-based base policy to predict next actions. This base policy is finetuned with RL or behavior cloning for downstream tasks, enabling sample-efficient adaptation and improved robustness. Real-world and simulation experiments demonstrate that HOP can generalize beyond the pretraining data and provide practical transfer benefits, though it shows some limitations on the hardest real-world task compared to strong image-based baselines. Overall, HOP offers a scalable approach to transfer human manipulation knowledge to dexterous robots using minimal task-specific demonstrations.
Abstract
We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: \url{https://hgaurav2k.github.io/hop/}.
