Table of Contents
Fetching ...

AVID: Adapting Video Diffusion Models to World Models

Marc Rigter, Tarun Gupta, Agrin Hilmkil, Chao Ma

TL;DR

This work tackles the scarcity of action-labelled data for sequential decision-making by leveraging unlabelled videos to build world models. It introduces AVID, an adapter-based method that conditions on actions by modifying the intermediate outputs of a pretrained video diffusion model, without requiring access to the pretrained weights. Through experiments on Procgen CoinRun and RT1 robotic data, AVID demonstrates competitive or superior performance to baselines that do not have weight access, particularly at smaller model sizes, and reveals that the learned mask effectively allocates motion planning between the pretrained prior and task-specific refinements. The study highlights the potential of repurposing large, pretrained video models for embodied AI tasks when access to internal parameters is limited, and it calls for API provisions that expose intermediate diffusion outputs to enable such flexible adaptations.

Abstract

Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed-source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.1 Our results demonstrate that if utilized correctly, pretrained video models have the potential to be powerful tools for embodied AI.

AVID: Adapting Video Diffusion Models to World Models

TL;DR

This work tackles the scarcity of action-labelled data for sequential decision-making by leveraging unlabelled videos to build world models. It introduces AVID, an adapter-based method that conditions on actions by modifying the intermediate outputs of a pretrained video diffusion model, without requiring access to the pretrained weights. Through experiments on Procgen CoinRun and RT1 robotic data, AVID demonstrates competitive or superior performance to baselines that do not have weight access, particularly at smaller model sizes, and reveals that the learned mask effectively allocates motion planning between the pretrained prior and task-specific refinements. The study highlights the potential of repurposing large, pretrained video models for embodied AI tasks when access to internal parameters is limited, and it calls for API provisions that expose intermediate diffusion outputs to enable such flexible adaptations.

Abstract

Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed-source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.1 Our results demonstrate that if utilized correctly, pretrained video models have the potential to be powerful tools for embodied AI.

Paper Structure

This paper contains 36 sections, 15 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Overview of AVID world model adapter architecture.
  • Figure 2: Top three rows: Examples of videos generated for RT1 (extended in Figure \ref{['fig:qual_extended']}, Appendix \ref{['app:qualitative']}). Bottom row: Mask generated in downsampled latent space by AVID. White indicates the mask is set to 1 and black indicates the mask is set to 0.
  • Figure 3: Top three rows: Examples of videos generated for Coinrun 500k (extended in Figure \ref{['fig:procgen_qualitative_extended']}, Appendix \ref{['app:qualitative']}). Bottom row: Mask generated by AVID where white indicates the mask is set to 1 and black indicates the mask is set to 0.
  • Figure 4: (a) RT1 averaged normalized performance versus parameter count. (b) Coinrun500k averaged normalized performance versus parameter count. (c) Coinrun averaged normalized performance versus dataset size. Details on metric normalization are in Appendix \ref{['app:normalization']}. (d) Average mask ($m$) values of AVID throughout diffusion process.
  • Figure 5: Examples of the mask, $m$, produced by AVID averaged throughout the diffusion process for Coinrun500k. White indicates the mask is set to 1, and black indicates the mask is set to 0.
  • ...and 5 more figures