Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

Sicheng Xie; Haidong Cao; Zejia Weng; Zhen Xing; Haoran Chen; Shiwei Shen; Jiaqi Leng; Zuxuan Wu; Yu-Gang Jiang

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

Sicheng Xie, Haidong Cao, Zejia Weng, Zhen Xing, Haoran Chen, Shiwei Shen, Jiaqi Leng, Zuxuan Wu, Yu-Gang Jiang

TL;DR

Human2Robot addresses the generalization gap in learning robot manipulation from human demonstrations by shifting from coarse alignment to dense frame-level alignment via a conditional video generation approach. It introduces H&R, a large aligned dataset of synchronized human and robot videos collected with VR teleoperation, and a two-stage framework where a Video Prediction Model learns robot dynamics from human videos and an action decoder converts predictive representations into robot actions. A KNN-based inference variant enables task execution without live demonstrations. The experiments show strong performance on seen tasks and notable one-shot generalization to unseen positions, objects, instances, and even new task categories, highlighting the potential of video-generation conditioned policies for real-world manipulation.

Abstract

Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing methods, which often rely on coarsely-aligned video pairs, are typically constrained to learning global or task-level features. As a result, they tend to neglect the fine-grained frame-level dynamics required for complex manipulation and generalization to novel tasks. We posit that this limitation stems from a vicious circle of inadequate datasets and the methods they inspire. To break this cycle, we propose a paradigm shift that treats fine-grained human-robot alignment as a conditional video generation problem. To this end, we first introduce H&R, a novel third-person dataset containing 2,600 episodes of precisely synchronized human and robot motions, collected using a VR teleoperation system. We then present Human2Robot, a framework designed to leverage this data. Human2Robot employs a Video Prediction Model to learn a rich and implicit representation of robot dynamics by generating robot videos from human input, which in turn guides a decoupled action decoder. Our real-world experiments demonstrate that this approach not only achieves high performance on seen tasks but also exhibits significant one-shot generalization to novel positions, objects, instances, and even new task categories.

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

TL;DR

Abstract

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)