Table of Contents
Fetching ...

Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos

Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Rybkin, Pieter Abbeel

TL;DR

Video2Policy presents a scalable pipeline that transforms internet RGB videos into executable simulation tasks and RL policies. It jointly reconstructs task scenes from videos and uses an LLM-driven, iterative reward design process to train policies in simulation, achieving high performance on diverse manipulation tasks. The approach demonstrates strong policy generalization from varied videos and enables sim-to-real transfer, highlighting a path toward broadly capable generalist robotic policies. By grounding tasks in real-world data and leveraging vision-language models for code generation, the work offers a practical route to scalable robotics data without manual task specification.

Abstract

Simulation offers a promising approach for cheaply scaling training data for generalist policies. To scalably generate data from diverse and realistic tasks, existing algorithms either rely on large language models (LLMs) that may hallucinate tasks not interesting for robotics; or digital twins, which require careful real-to-sim alignment and are hard to scale. To address these challenges, we introduce Video2Policy, a novel framework that leverages internet RGB videos to reconstruct tasks based on everyday human behavior. Our approach comprises two phases: (1) task generation in simulation from videos; and (2) reinforcement learning utilizing in-context LLM-generated reward functions iteratively. We demonstrate the efficacy of Video2Policy by reconstructing over 100 videos from the Something-Something-v2 (SSv2) dataset, which depicts diverse and complex human behaviors on 9 different tasks. Our method can successfully train RL policies on such tasks, including complex and challenging tasks such as throwing. Finally, we show that the generated simulation data can be scaled up for training a general policy, and it can be transferred back to the real robot in a Real2Sim2Real way.

Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos

TL;DR

Video2Policy presents a scalable pipeline that transforms internet RGB videos into executable simulation tasks and RL policies. It jointly reconstructs task scenes from videos and uses an LLM-driven, iterative reward design process to train policies in simulation, achieving high performance on diverse manipulation tasks. The approach demonstrates strong policy generalization from varied videos and enables sim-to-real transfer, highlighting a path toward broadly capable generalist robotic policies. By grounding tasks in real-world data and leveraging vision-language models for code generation, the work offers a practical route to scalable robotics data without manual task specification.

Abstract

Simulation offers a promising approach for cheaply scaling training data for generalist policies. To scalably generate data from diverse and realistic tasks, existing algorithms either rely on large language models (LLMs) that may hallucinate tasks not interesting for robotics; or digital twins, which require careful real-to-sim alignment and are hard to scale. To address these challenges, we introduce Video2Policy, a novel framework that leverages internet RGB videos to reconstruct tasks based on everyday human behavior. Our approach comprises two phases: (1) task generation in simulation from videos; and (2) reinforcement learning utilizing in-context LLM-generated reward functions iteratively. We demonstrate the efficacy of Video2Policy by reconstructing over 100 videos from the Something-Something-v2 (SSv2) dataset, which depicts diverse and complex human behaviors on 9 different tasks. Our method can successfully train RL policies on such tasks, including complex and challenging tasks such as throwing. Finally, we show that the generated simulation data can be scaled up for training a general policy, and it can be transferred back to the real robot in a Real2Sim2Real way.

Paper Structure

This paper contains 16 sections, 11 figures, 9 tables.

Figures (11)

  • Figure 1: The Video2Policy framework can leverage internet videos to generate simulation tasks and learn policies for them automatically, which can be considered a data engine for generalist policies.
  • Figure 2: Some visualization of the tasks generated from SSv2 Video Dataset.
  • Figure 3: Examples of the Segmentation Mask Observation between the simulation and the real, which can better bridge the sim2real gap.
  • Figure 4: V2P achieves better performance across iteration.
  • Figure 5: Performance of the trained general policy on 10 unseen task instances. BC-V2P outperforms BC-CoP on 9 of 10 significantly.
  • ...and 6 more figures