Table of Contents
Fetching ...

Simulating the Visual World with Artificial Intelligence: A Roadmap

Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, Ziwei Liu

TL;DR

This paper defines a roadmap from video generation to world models by framing video foundation models as a combination of an implicit world model and a video renderer. It introduces a four-generation taxonomy (Faithfulness, Interactiveness, Planning, Stochasticity) and formalizes navigation modes to unify interaction signals. The survey reviews methods across general scenes, robotics, autonomous driving, and gaming, highlighting geometry, 3D priors, physics priors, and multi-modal conditioning as enabling ingredients. It discusses open challenges, future directions, and the societal implications of increasingly capable world models and their potential to transform simulation-based reasoning and decision making.

Abstract

The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a "window" into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.

Simulating the Visual World with Artificial Intelligence: A Roadmap

TL;DR

This paper defines a roadmap from video generation to world models by framing video foundation models as a combination of an implicit world model and a video renderer. It introduces a four-generation taxonomy (Faithfulness, Interactiveness, Planning, Stochasticity) and formalizes navigation modes to unify interaction signals. The survey reviews methods across general scenes, robotics, autonomous driving, and gaming, highlighting geometry, 3D priors, physics priors, and multi-modal conditioning as enabling ingredients. It discusses open challenges, future directions, and the societal implications of increasingly capable world models and their potential to transform simulation-based reasoning and decision making.

Abstract

The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a "window" into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.

Paper Structure

This paper contains 26 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of 4 Generations and 3 Core Capabilities from Video Generation to World Model. The figure illustrates the key capabilities emphasized in the first through third generations of world models, as well as our insight for future world models. We outline a long-term vision of world models that can simulate a broad range of environments across multiple spatial and temporal scales. The figure highlights four foundational characteristics: real-time responsiveness, stochasticity, multi-scale planning, and intrinsic physical faithfulness. These collectively support the long-term goal of zero-shot generalization.
  • Figure 2: Overview of Annual Papers and Articles Paper Counts in Video Generation & World Model Fields. The article count was derived from searches conducted using the fixed keyword combination “video generation” and “world model” from Google Scholar and arXiv.
  • Figure 3: The Characteristics of Physical World Model and Mental World Model. This figure highlights the distinct inputs, internal processes, and outputs, as well as the interaction through perception, planning, and guidance between the physical world model and the mental world model.
  • Figure 4: Overview of the World Model Defined in this Paper. The world model must take inputs such as text, images, videos, audios or their combinations. It may also incorporate external conditions for interaction, including spatial conditions and navigation conditions. A video generation model is leveraged to process the intermediate state representations to produce video outputs, while other task-specific outputs may also be generated depending on the downstream application.
  • Figure 5: Overview of the Capabilities of World Model Across 4 Generations. This figure presents the three main capabilities of world models, along with their corresponding secondary capabilities under each category.
  • ...and 3 more figures