Table of Contents
Fetching ...

Vid2World: Crafting Video Diffusion Models to Interactive World Models

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long

TL;DR

Vid2World tackles the data-hungry nature of world models by repurposing pre-trained internet-scale video diffusion models as interactive, autoregressive world models. It introduces video diffusion causalization to enforce temporal causality and causal action guidance to incorporate frame-level actions, enabling accurate, action-conditioned rollouts. Across robot manipulation, 3D game simulation, and open-world navigation, Vid2World achieves state-of-the-art transfer performance and supports downstream decision-making, including Real2Sim policy evaluation. The work demonstrates a scalable pathway for leveraging large video priors to build high-fidelity, interactive world models with limited interaction data.

Abstract

World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision-making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large-scale internet data have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores video diffusion causalization, reshaping both the architecture and training objective of pre-trained models to enable autoregressive generation. Additionally, it incorporates a causal action guidance mechanism to enhance action controllability in the resulting interactive world models. Extensive experiments across multiple domains, including robot manipulation, 3D game simulation, and open-world navigation, demonstrate that our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.

Vid2World: Crafting Video Diffusion Models to Interactive World Models

TL;DR

Vid2World tackles the data-hungry nature of world models by repurposing pre-trained internet-scale video diffusion models as interactive, autoregressive world models. It introduces video diffusion causalization to enforce temporal causality and causal action guidance to incorporate frame-level actions, enabling accurate, action-conditioned rollouts. Across robot manipulation, 3D game simulation, and open-world navigation, Vid2World achieves state-of-the-art transfer performance and supports downstream decision-making, including Real2Sim policy evaluation. The work demonstrates a scalable pathway for leveraging large video priors to build high-fidelity, interactive world models with limited interaction data.

Abstract

World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision-making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large-scale internet data have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores video diffusion causalization, reshaping both the architecture and training objective of pre-trained models to enable autoregressive generation. Additionally, it incorporates a causal action guidance mechanism to enhance action controllability in the resulting interactive world models. Extensive experiments across multiple domains, including robot manipulation, 3D game simulation, and open-world navigation, demonstrate that our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.

Paper Structure

This paper contains 87 sections, 1 theorem, 44 equations, 16 figures, 3 tables, 3 algorithms.

Key Result

Proposition 1

Assuming the input sequence $\mathbf{z}_t \triangleq f(t)$ is generated by a twice-differentiable L-smooth function $f(t)$, the approximation error of the Extrapolative Weight Transfer (EWT) can be bounded by:

Figures (16)

  • Figure 1: Vid2World repurposes video diffusion models for interactive world modeling. From the perspective of the data pyramid for world models, it leverages vast pre-trained knowledge from internet-scale, action-free video data to achieve high-fidelity, action-conditioned generation across diverse downstream domains with limited interaction data.
  • Figure 2: Transforming video diffusion models into interactive world models involves two key challenges: (1) Causal generation: converting full-sequence diffusion models into causal diffusion models; (2) Action conditioning: adapting causal diffusion models into interactive world models.
  • Figure 3: Illustration of weight transfer mechanisms for temporal convolution layers: (1) Shift: shifts all weights into the past. (2) Masked: retains only past weights. (3) Extrapolative: more principledly leverages local linear feature relationships (example shown with $m=1, p=2$).
  • Figure 4: Training and sampling of Vid2World, initialized by architecture causalization. (a) During training, we add independently sampled noise levels to each frame, as well as randomly drop out each action with a fixed probability. (b) For auto-regressive rollout, we denoise the latest frame while setting history clean. Action guidance is added for the current action. See Appendix \ref{['app:model_detail']} for details.
  • Figure 5: Vid2World for real2sim policy evaluation, validated by real-world evaluation.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof