Table of Contents
Fetching ...

Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans

Homanga Bharadhwaj, Abhinav Gupta, Vikash Kumar, Shubham Tulsiani

TL;DR

This work targets generalist robots capable of zero-shot manipulation on unseen objects by factorizing policy learning into a human-plan predictor and a robot-action translator. It leverages large-scale passive human videos to learn plausible hand-object interaction plans via diffusion, and uses a lightweight translator trained on limited in-domain data to map those plans to robot actions. Across 100 real-world tasks involving 16 skills and 40 objects, the approach demonstrates strong generalization in both table-top and in-the-wild settings, reducing deployment-time training. By combining web-scale human data with minimal robot demonstrations, the method addresses data bottlenecks in robot learning and highlights a practical path toward broadly capable manipulation systems.

Abstract

We pursue the goal of developing robots that can interact zero-shot with generic unseen objects via a diverse repertoire of manipulation skills and show how passive human videos can serve as a rich source of data for learning such generalist robots. Unlike typical robot learning approaches which directly learn how a robot should act from interaction data, we adopt a factorized approach that can leverage large-scale human videos to learn how a human would accomplish a desired task (a human plan), followed by translating this plan to the robots embodiment. Specifically, we learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, and allows following humans plans for generic manipulation tasks in a zero-shot manner with no deployment-time training. Importantly, while the plan predictor can leverage large-scale human videos for learning, the translation module only requires a small amount of in-domain data, and can generalize to tasks not seen during training. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation. https://homangab.github.io/hopman/

Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans

TL;DR

This work targets generalist robots capable of zero-shot manipulation on unseen objects by factorizing policy learning into a human-plan predictor and a robot-action translator. It leverages large-scale passive human videos to learn plausible hand-object interaction plans via diffusion, and uses a lightweight translator trained on limited in-domain data to map those plans to robot actions. Across 100 real-world tasks involving 16 skills and 40 objects, the approach demonstrates strong generalization in both table-top and in-the-wild settings, reducing deployment-time training. By combining web-scale human data with minimal robot demonstrations, the method addresses data bottlenecks in robot learning and highlights a practical path toward broadly capable manipulation systems.

Abstract

We pursue the goal of developing robots that can interact zero-shot with generic unseen objects via a diverse repertoire of manipulation skills and show how passive human videos can serve as a rich source of data for learning such generalist robots. Unlike typical robot learning approaches which directly learn how a robot should act from interaction data, we adopt a factorized approach that can leverage large-scale human videos to learn how a human would accomplish a desired task (a human plan), followed by translating this plan to the robots embodiment. Specifically, we learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, and allows following humans plans for generic manipulation tasks in a zero-shot manner with no deployment-time training. Importantly, while the plan predictor can leverage large-scale human videos for learning, the translation module only requires a small amount of in-domain data, and can generalize to tasks not seen during training. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation. https://homangab.github.io/hopman/
Paper Structure (21 sections, 2 equations, 11 figures)

This paper contains 21 sections, 2 equations, 11 figures.

Figures (11)

  • Figure 1: A subset of different manipulation behaviors generated by our framework HOPMan . By learning task-agnostic human-plan prediction and robot-action translation models, our system can interact with generic objects and execute diverse skills e.g. unrolling, scooping, pouring, re-orientation, articulated object manipulation, etc. Videos are in the supplementary website https://homangab.github.io/hopman/
  • Figure 2: HOPMan consists of a human-interaction-plan prediction model (left), and a robot-action translation model (right). Given an initial image of a scene $\mathbf{X_0}$ and a goal image $\mathbf{X_g}$, a diffusion model hallucinates plausible future hand and object masks $M_{1:K}$. These predictions along with current RGB observations of the scene $\mathbf{X}_t$ go as input to a translation model (instantiated as a closed-loop policy $\pi(\cdot)$) that outputs robot actions $a_t$ for executing the motions on a robot. Additional details on the approach are in section \ref{['sec:overall_framework']}.
  • Figure 3: Detailed illustration of a training pass through the future prediction model. This is a diffusion model, with a U-net that predicts per-frame noise at each step $p$ of the diffusion process. Additional details on the model and training are in Section \ref{['sec:fm']}.
  • Figure 4: Architecture of the translation model that transforms predicted future hand-object masks to a robot trajectory, described in section \ref{['sec:translation']}
  • Figure 5: Illustration of the different steps in generating hallucinated human hand trajectories from robot trajectories. This is an alternate data source for the translation model in addition to collecting paired human-robot data.
  • ...and 6 more figures