Table of Contents
Fetching ...

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

Sicheng Xie, Haidong Cao, Zejia Weng, Zhen Xing, Haoran Chen, Shiwei Shen, Jiaqi Leng, Zuxuan Wu, Yu-Gang Jiang

TL;DR

Human2Robot addresses the generalization gap in learning robot manipulation from human demonstrations by shifting from coarse alignment to dense frame-level alignment via a conditional video generation approach. It introduces H&R, a large aligned dataset of synchronized human and robot videos collected with VR teleoperation, and a two-stage framework where a Video Prediction Model learns robot dynamics from human videos and an action decoder converts predictive representations into robot actions. A KNN-based inference variant enables task execution without live demonstrations. The experiments show strong performance on seen tasks and notable one-shot generalization to unseen positions, objects, instances, and even new task categories, highlighting the potential of video-generation conditioned policies for real-world manipulation.

Abstract

Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing methods, which often rely on coarsely-aligned video pairs, are typically constrained to learning global or task-level features. As a result, they tend to neglect the fine-grained frame-level dynamics required for complex manipulation and generalization to novel tasks. We posit that this limitation stems from a vicious circle of inadequate datasets and the methods they inspire. To break this cycle, we propose a paradigm shift that treats fine-grained human-robot alignment as a conditional video generation problem. To this end, we first introduce H&R, a novel third-person dataset containing 2,600 episodes of precisely synchronized human and robot motions, collected using a VR teleoperation system. We then present Human2Robot, a framework designed to leverage this data. Human2Robot employs a Video Prediction Model to learn a rich and implicit representation of robot dynamics by generating robot videos from human input, which in turn guides a decoupled action decoder. Our real-world experiments demonstrate that this approach not only achieves high performance on seen tasks but also exhibits significant one-shot generalization to novel positions, objects, instances, and even new task categories.

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

TL;DR

Human2Robot addresses the generalization gap in learning robot manipulation from human demonstrations by shifting from coarse alignment to dense frame-level alignment via a conditional video generation approach. It introduces H&R, a large aligned dataset of synchronized human and robot videos collected with VR teleoperation, and a two-stage framework where a Video Prediction Model learns robot dynamics from human videos and an action decoder converts predictive representations into robot actions. A KNN-based inference variant enables task execution without live demonstrations. The experiments show strong performance on seen tasks and notable one-shot generalization to unseen positions, objects, instances, and even new task categories, highlighting the potential of video-generation conditioned policies for real-world manipulation.

Abstract

Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing methods, which often rely on coarsely-aligned video pairs, are typically constrained to learning global or task-level features. As a result, they tend to neglect the fine-grained frame-level dynamics required for complex manipulation and generalization to novel tasks. We posit that this limitation stems from a vicious circle of inadequate datasets and the methods they inspire. To break this cycle, we propose a paradigm shift that treats fine-grained human-robot alignment as a conditional video generation problem. To this end, we first introduce H&R, a novel third-person dataset containing 2,600 episodes of precisely synchronized human and robot motions, collected using a VR teleoperation system. We then present Human2Robot, a framework designed to leverage this data. Human2Robot employs a Video Prediction Model to learn a rich and implicit representation of robot dynamics by generating robot videos from human input, which in turn guides a decoupled action decoder. Our real-world experiments demonstrate that this approach not only achieves high performance on seen tasks but also exhibits significant one-shot generalization to novel positions, objects, instances, and even new task categories.

Paper Structure

This paper contains 24 sections, 9 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Human2Robot: An human-video-conditioned policy, capable of completing seen tasks and one-shot performing unseen tasks with a single human video.
  • Figure 2: Dataset Overview.(L) The ratio of four basic task types and long tasks. (R) Platform environment and the object instances used.
  • Figure 3: Architecture overview of Human2Robot. Our approach consists of two training stages. In the first stage, we train a Video Prediction Model (VPM) to generate robotic arm videos conditioned on human videos. In the second stage, we freeze the VPM and train an action decoder to predict robot actions based on the motion features generated by the VPM.
  • Figure 4: Task overview. We train the models on seen tasks and test them on different generalization ability level.
  • Figure 5: Visualization of VPM results. We can observe that that a 1-step denoised result already contains sufficient motion information for downstream tasks. In addition, the 30-step (fully denoised) result is very close to the GT Robot video, demonstrating the effective design of our Video Prediction Model (VPM).
  • ...and 4 more figures