Table of Contents
Fetching ...

Video Generators are Robot Policies

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, Carl Vondrick

TL;DR

This work reframes robot policy learning as video generation by introducing Video Policy, a diffusion-based framework that jointly generates future video frames and robot actions from an initial scene and task description. By training a video generator and a lightweight action head in a two-stage process and preventing action-loss gradients from updating the video model, the method leverages action-free video data to learn robust, sample-efficient policies. Empirically, Video Policy achieves state-of-the-art performance on RoboCasa and Libero10 benchmarks with far fewer demonstrations than prior methods and demonstrates solid real-world generalization across object locations, unseen objects, and backgrounds. The findings suggest that powerful video priors from scalable video models can dramatically improve data efficiency and generalization in manipulation tasks, albeit with notable computational costs and the need for broader validation across more environments and architectures.

Abstract

Despite tremendous progress in dexterous manipulation, current visuomotor policies remain fundamentally limited by two challenges: they struggle to generalize under perceptual or behavioral distribution shifts, and their performance is constrained by the size of human demonstration data. In this paper, we use video generation as a proxy for robot policy learning to address both limitations simultaneously. We propose Video Policy, a modular framework that combines video and action generation that can be trained end-to-end. Our results demonstrate that learning to generate videos of robot behavior allows for the extraction of policies with minimal demonstration data, significantly improving robustness and sample efficiency. Our method shows strong generalization to unseen objects, backgrounds, and tasks, both in simulation and the real world. We further highlight that task success is closely tied to the generated video, with action-free video data providing critical benefits for generalizing to novel tasks. By leveraging large-scale video generative models, we achieve superior performance compared to traditional behavior cloning, paving the way for more scalable and data-efficient robot policy learning.

Video Generators are Robot Policies

TL;DR

This work reframes robot policy learning as video generation by introducing Video Policy, a diffusion-based framework that jointly generates future video frames and robot actions from an initial scene and task description. By training a video generator and a lightweight action head in a two-stage process and preventing action-loss gradients from updating the video model, the method leverages action-free video data to learn robust, sample-efficient policies. Empirically, Video Policy achieves state-of-the-art performance on RoboCasa and Libero10 benchmarks with far fewer demonstrations than prior methods and demonstrates solid real-world generalization across object locations, unseen objects, and backgrounds. The findings suggest that powerful video priors from scalable video models can dramatically improve data efficiency and generalization in manipulation tasks, albeit with notable computational costs and the need for broader validation across more environments and architectures.

Abstract

Despite tremendous progress in dexterous manipulation, current visuomotor policies remain fundamentally limited by two challenges: they struggle to generalize under perceptual or behavioral distribution shifts, and their performance is constrained by the size of human demonstration data. In this paper, we use video generation as a proxy for robot policy learning to address both limitations simultaneously. We propose Video Policy, a modular framework that combines video and action generation that can be trained end-to-end. Our results demonstrate that learning to generate videos of robot behavior allows for the extraction of policies with minimal demonstration data, significantly improving robustness and sample efficiency. Our method shows strong generalization to unseen objects, backgrounds, and tasks, both in simulation and the real world. We further highlight that task success is closely tied to the generated video, with action-free video data providing critical benefits for generalizing to novel tasks. By leveraging large-scale video generative models, we achieve superior performance compared to traditional behavior cloning, paving the way for more scalable and data-efficient robot policy learning.

Paper Structure

This paper contains 20 sections, 4 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Video Generation as a Proxy for Robot Policy Learning. Given an initial observation and a natural language task prompt, our model generates a video of a robot executing the task (top) jointly with generating robot actions via a separate diffusion network (middle). This modular design enables learning from action-free video data and improves generalization to unseen scenarios, offering a scalable and sample-efficient alternative to traditional behavior cloning.
  • Figure 2: Video Policy takes an image of the initial environment state together with the noise vectors corresponding to the future frames and actions as input. It then jointly diffuses the frames and actions, using the representation of the frames as conditioning for the action denoiser. This modular design allows for training the two networks separately, opening way for action-free learning of the task dynamics via video generation.
  • Figure 3: Success rate of Video Policy as a function of the video prediction horizon. Learning the dynamics of the environment is critical for achieving generalization in policy learning, as evident by the larger effect of prediction horizon on the task with distribution shift.
  • Figure 4: Generalization to tasks with no policy supervision by capitalizing on action-free video data. Both our behavior cloning head and the baseline DP are trained on 12 tasks on the left, but our video generation model also has access to action-free videos for all 24 tasks. The upper bounds on the right correspond to models trained with full action supervision for comparison.
  • Figure 5: Qualitative results for Pick and Place generalization experiments in the real world. Video Policy demonstrates strong robustness to object locations, appearance and background colour.
  • ...and 10 more figures