Vid2World: Crafting Video Diffusion Models to Interactive World Models
Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long
TL;DR
Vid2World tackles the data-hungry nature of world models by repurposing pre-trained internet-scale video diffusion models as interactive, autoregressive world models. It introduces video diffusion causalization to enforce temporal causality and causal action guidance to incorporate frame-level actions, enabling accurate, action-conditioned rollouts. Across robot manipulation, 3D game simulation, and open-world navigation, Vid2World achieves state-of-the-art transfer performance and supports downstream decision-making, including Real2Sim policy evaluation. The work demonstrates a scalable pathway for leveraging large video priors to build high-fidelity, interactive world models with limited interaction data.
Abstract
World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision-making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large-scale internet data have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores video diffusion causalization, reshaping both the architecture and training objective of pre-trained models to enable autoregressive generation. Additionally, it incorporates a causal action guidance mechanism to enhance action controllability in the resulting interactive world models. Extensive experiments across multiple domains, including robot manipulation, 3D game simulation, and open-world navigation, demonstrate that our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.
