Table of Contents
Fetching ...

Contrast, Imitate, Adapt: Learning Robotic Skills From Raw Human Videos

Zhifeng Qian, Mingyu You, Hongjun Zhou, Xuanhui Xu, Hao Fu, Jinzhe Xue, Bin He

TL;DR

This work tackles learning robotic manipulation from raw human videos without action labels. It introduces Contrast-Imitate-Adapt (CIA), a three-stage pipeline comprising an interaction-aware alignment transformer (IAAformer) to learn task priors, TrajGAN to imitate action priors as trajectories, and an Inversion-Interaction framework with trajectory-semantics (TS-CEM) to adapt to novel layouts through limited interaction. CIA demonstrates superior task success and generalization across six real-world manipulation tasks, significantly outperforming state-of-the-art baselines and prior temporal-alignment methods. The approach offers a scalable path toward robust robot learning from abundant human videos by combining video-level priors, learned action priors, and safe adaptation strategies.

Abstract

Learning robotic skills from raw human videos remains a non-trivial challenge. Previous works tackled this problem by leveraging behavior cloning or learning reward functions from videos. Despite their remarkable performances, they may introduce several issues, such as the necessity for robot actions, requirements for consistent viewpoints and similar layouts between human and robot videos, as well as low sample efficiency. To this end, our key insight is to learn task priors by contrasting videos and to learn action priors through imitating trajectories from videos, and to utilize the task priors to guide trajectories to adapt to novel scenarios. We propose a three-stage skill learning framework denoted as Contrast-Imitate-Adapt (CIA). An interaction-aware alignment transformer is proposed to learn task priors by temporally aligning video pairs. Then a trajectory generation model is used to learn action priors. To adapt to novel scenarios different from human videos, the Inversion-Interaction method is designed to initialize coarse trajectories and refine them by limited interaction. In addition, CIA introduces an optimization method based on semantic directions of trajectories for interaction security and sample efficiency. The alignment distances computed by IAAformer are used as the rewards. We evaluate CIA in six real-world everyday tasks, and empirically demonstrate that CIA significantly outperforms previous state-of-the-art works in terms of task success rate and generalization to diverse novel scenarios layouts and object instances.

Contrast, Imitate, Adapt: Learning Robotic Skills From Raw Human Videos

TL;DR

This work tackles learning robotic manipulation from raw human videos without action labels. It introduces Contrast-Imitate-Adapt (CIA), a three-stage pipeline comprising an interaction-aware alignment transformer (IAAformer) to learn task priors, TrajGAN to imitate action priors as trajectories, and an Inversion-Interaction framework with trajectory-semantics (TS-CEM) to adapt to novel layouts through limited interaction. CIA demonstrates superior task success and generalization across six real-world manipulation tasks, significantly outperforming state-of-the-art baselines and prior temporal-alignment methods. The approach offers a scalable path toward robust robot learning from abundant human videos by combining video-level priors, learned action priors, and safe adaptation strategies.

Abstract

Learning robotic skills from raw human videos remains a non-trivial challenge. Previous works tackled this problem by leveraging behavior cloning or learning reward functions from videos. Despite their remarkable performances, they may introduce several issues, such as the necessity for robot actions, requirements for consistent viewpoints and similar layouts between human and robot videos, as well as low sample efficiency. To this end, our key insight is to learn task priors by contrasting videos and to learn action priors through imitating trajectories from videos, and to utilize the task priors to guide trajectories to adapt to novel scenarios. We propose a three-stage skill learning framework denoted as Contrast-Imitate-Adapt (CIA). An interaction-aware alignment transformer is proposed to learn task priors by temporally aligning video pairs. Then a trajectory generation model is used to learn action priors. To adapt to novel scenarios different from human videos, the Inversion-Interaction method is designed to initialize coarse trajectories and refine them by limited interaction. In addition, CIA introduces an optimization method based on semantic directions of trajectories for interaction security and sample efficiency. The alignment distances computed by IAAformer are used as the rewards. We evaluate CIA in six real-world everyday tasks, and empirically demonstrate that CIA significantly outperforms previous state-of-the-art works in terms of task success rate and generalization to diverse novel scenarios layouts and object instances.
Paper Structure (26 sections, 20 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 20 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Existing issues of robot learning from human videos.
  • Figure 2: Illustration of our framework CIA. Priors are first extracted by pre-trained models. Then, IAAformer aims to temporally align video pairs while TrajGAN learns to generate trajectories. To adapt to novel scenarios, TrajGAN is initialized by GAN inversion and improved by the proposed TS-CEM where the rewards are output by IAAformer.
  • Figure 3: Illustration of extraction and transformation of human priors and examples of action priors.
  • Figure 4: Architecture of our Interaction-Aware Alignment transformer (IAAformer).
  • Figure 5: The schematic diagram of self-supervised learning with the hindsight augmentation.
  • ...and 9 more figures