Table of Contents
Fetching ...

Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt

Xiang Zhu, Yichen Liu, Hezhong Li, Jianyu Chen

TL;DR

The paper tackles the challenge of generalizing robot manipulation policies to unseen tasks without additional teleoperation data by leveraging human demonstration videos as prompts. It introduces a two-stage framework: Stage 1 learns embodiment-transfer representations through cross-prediction in a video diffusion model, and Stage 2 trains a diffusion policy using a unified human–robot action space augmented with ProtDiffusion Contrastive Policy (PDCP) losses to align cross-modal representations. Key contributions include the VGCP cross-embodiment video generation, a unified action space bridging human and robot demonstrations, and the PDCP objective combining NT-Xent, prototypical cross-entropy, and Siamese metric losses to enhance multi-skill generalization. Empirical results on real-world dexterous manipulation show improved generalization across object, position, and scene variations and successful unseen-skill transfers, demonstrating a scalable, teleoperation-free path toward robust robot learning from human videos.

Abstract

Recent robot learning methods commonly rely on imitation learning from massive robotic dataset collected with teleoperation. When facing a new task, such methods generally require collecting a set of new teleoperation data and finetuning the policy. Furthermore, the teleoperation data collection pipeline is also tedious and expensive. Instead, human is able to efficiently learn new tasks by just watching others do. In this paper, we introduce a novel two-stage framework that utilizes human demonstrations to learn a generalizable robot policy. Such policy can directly take human demonstration video as a prompt and perform new tasks without any new teleoperation data and model finetuning at all. In the first stage, we train video generation model that captures a joint representation for both the human and robot demonstration video data using cross-prediction. In the second stage, we fuse the learned representation with a shared action space between human and robot using a novel prototypical contrastive loss. Empirical evaluations on real-world dexterous manipulation tasks show the effectiveness and generalization capabilities of our proposed method.

Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt

TL;DR

The paper tackles the challenge of generalizing robot manipulation policies to unseen tasks without additional teleoperation data by leveraging human demonstration videos as prompts. It introduces a two-stage framework: Stage 1 learns embodiment-transfer representations through cross-prediction in a video diffusion model, and Stage 2 trains a diffusion policy using a unified human–robot action space augmented with ProtDiffusion Contrastive Policy (PDCP) losses to align cross-modal representations. Key contributions include the VGCP cross-embodiment video generation, a unified action space bridging human and robot demonstrations, and the PDCP objective combining NT-Xent, prototypical cross-entropy, and Siamese metric losses to enhance multi-skill generalization. Empirical results on real-world dexterous manipulation show improved generalization across object, position, and scene variations and successful unseen-skill transfers, demonstrating a scalable, teleoperation-free path toward robust robot learning from human videos.

Abstract

Recent robot learning methods commonly rely on imitation learning from massive robotic dataset collected with teleoperation. When facing a new task, such methods generally require collecting a set of new teleoperation data and finetuning the policy. Furthermore, the teleoperation data collection pipeline is also tedious and expensive. Instead, human is able to efficiently learn new tasks by just watching others do. In this paper, we introduce a novel two-stage framework that utilizes human demonstrations to learn a generalizable robot policy. Such policy can directly take human demonstration video as a prompt and perform new tasks without any new teleoperation data and model finetuning at all. In the first stage, we train video generation model that captures a joint representation for both the human and robot demonstration video data using cross-prediction. In the second stage, we fuse the learned representation with a shared action space between human and robot using a novel prototypical contrastive loss. Empirical evaluations on real-world dexterous manipulation tasks show the effectiveness and generalization capabilities of our proposed method.

Paper Structure

This paper contains 31 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: First, fine-tune a video diffusion model on diverse datasets to obtain informative representations. In the second stage, use a video generation model to extract information from human prompt videos for skill learning with both human and robot data. Finally, during inference, the model prompts on unseen human demos to perform tasks based on the human input.
  • Figure 2: Stage 1: We fine-tune a pre-trained video generation model using cross-prediction, enabling the model to retain physical knowledge while gaining the ability for embodiment modality transfer.
  • Figure 3: Stage 2: We use the diffusion model trained in the first stage to combine information from human prompt videos with the target embodiment, generating an informative representation. This representation, along with a shared action between human and robot and a prototypical contrastive loss, enables the diffusion policy to learn common task and skill information for both.
  • Figure 4: Examples of Cross-Prediction Video Generation.
  • Figure 5: Evaluation for the Learned Representations
  • ...and 1 more figures