Table of Contents
Fetching ...

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Haoran He, Chenjia Bai, Ling Pan, Weinan Zhang, Bin Zhao, Xuelong Li

TL;DR

This work introduces VPDD, a framework that pre-trains a unified discrete diffusion model on large-scale actionless human videos and fine-tunes it on limited robot demonstrations to learn actionable policies. It encodes both human and robot videos into a shared discrete latent space via VQ-VAE, and uses a mask-and-replace diffusion objective to predict future tokens conditioned on history and language. A dual-network setup—Perceiver Transformer for video tokens and GPT-2 for actions—enables learning video dynamics and action forecasting, which guides few-shot policy learning. Experiments on Meta-World and RLBench show VPDD achieving superior transfer, strong generalization to unseen scenes, and improved sample efficiency compared to state-of-the-art baselines. The approach demonstrates the practical impact of large-scale actionless video pretraining for embodied AI with limited robot data.

Abstract

Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. However, it remains a challenge due to the domain gap between humans and robots. Moreover, it is difficult to extract useful information representing the dynamic world from human videos, because of its noisy and multimodal data structure. In this paper, we introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning with a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior performance. Our project website is available at https://video-diff.github.io/.

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

TL;DR

This work introduces VPDD, a framework that pre-trains a unified discrete diffusion model on large-scale actionless human videos and fine-tunes it on limited robot demonstrations to learn actionable policies. It encodes both human and robot videos into a shared discrete latent space via VQ-VAE, and uses a mask-and-replace diffusion objective to predict future tokens conditioned on history and language. A dual-network setup—Perceiver Transformer for video tokens and GPT-2 for actions—enables learning video dynamics and action forecasting, which guides few-shot policy learning. Experiments on Meta-World and RLBench show VPDD achieving superior transfer, strong generalization to unseen scenes, and improved sample efficiency compared to state-of-the-art baselines. The approach demonstrates the practical impact of large-scale actionless video pretraining for embodied AI with limited robot data.

Abstract

Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. However, it remains a challenge due to the domain gap between humans and robots. Moreover, it is difficult to extract useful information representing the dynamic world from human videos, because of its noisy and multimodal data structure. In this paper, we introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning with a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior performance. Our project website is available at https://video-diff.github.io/.
Paper Structure (29 sections, 15 equations, 12 figures, 4 tables, 2 algorithms)

This paper contains 29 sections, 15 equations, 12 figures, 4 tables, 2 algorithms.

Figures (12)

  • Figure 1: Overall framework of VPDD.
  • Figure 2: Overall pipeline of VPDD. A video-based VQ-VAE is leveraged to encode both human and robot videos into discrete latent codes. Subsequently, a unified discrete diffusion is firstly pre-trained on these video latent codes via a self-supervised objective, predicting future videos conditioning on language instructions and historical videos. The pre-trained video prediction model $p_{\theta_1}$ can capture temporal dynamics and task-specific representations. Lastly, we fine-tune our diffusion model on a limited number of robot data. In each diffusion step of the fine-tuning stage, we leverage $p_{\theta_1}$ to provide hidden representations $z_{\tilde{\bm{x}}_{0}^{\rm v}}$ to benefit downstream action learning with video foresight. This integration of video prediction and action learning is achieved through our unified discrete diffusion.
  • Figure 3: Single-view and multi-view images from Meta-World button-press and RLBench drug-stick tasks, sampled from videos predicted by $p_{\theta_1}$.
  • Figure 4: Average success rate across 3 seeds on MT50-rand. Each task is evaluated for 50 episodes.
  • Figure 5: Average success rate across 3 seeds on shifted button-press-v2 and handle-press-v2 tasks.
  • ...and 7 more figures