Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans
Homanga Bharadhwaj, Abhinav Gupta, Vikash Kumar, Shubham Tulsiani
TL;DR
This work targets generalist robots capable of zero-shot manipulation on unseen objects by factorizing policy learning into a human-plan predictor and a robot-action translator. It leverages large-scale passive human videos to learn plausible hand-object interaction plans via diffusion, and uses a lightweight translator trained on limited in-domain data to map those plans to robot actions. Across 100 real-world tasks involving 16 skills and 40 objects, the approach demonstrates strong generalization in both table-top and in-the-wild settings, reducing deployment-time training. By combining web-scale human data with minimal robot demonstrations, the method addresses data bottlenecks in robot learning and highlights a practical path toward broadly capable manipulation systems.
Abstract
We pursue the goal of developing robots that can interact zero-shot with generic unseen objects via a diverse repertoire of manipulation skills and show how passive human videos can serve as a rich source of data for learning such generalist robots. Unlike typical robot learning approaches which directly learn how a robot should act from interaction data, we adopt a factorized approach that can leverage large-scale human videos to learn how a human would accomplish a desired task (a human plan), followed by translating this plan to the robots embodiment. Specifically, we learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, and allows following humans plans for generic manipulation tasks in a zero-shot manner with no deployment-time training. Importantly, while the plan predictor can leverage large-scale human videos for learning, the translation module only requires a small amount of in-domain data, and can generalize to tasks not seen during training. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation. https://homangab.github.io/hopman/
