Table of Contents
Fetching ...

Simulating the Real World: A Unified Survey of Multimodal Generative Models

Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong

TL;DR

This survey is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework and serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.

Abstract

Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.

Simulating the Real World: A Unified Survey of Multimodal Generative Models

TL;DR

This survey is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework and serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.

Abstract

Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.

Paper Structure

This paper contains 32 sections, 8 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Roadmap of dimensional growth from 2D images to video, 3D, and 4D content in real-world simulation, outlining a conceptual taxonomy based on the coverage of data properties (i.e., appearance, geometry, and dynamics).
  • Figure 2: The Dimensional Evolution of Generative AI. We present a unified framework connecting 2D, Video, 3D, and 4D generation through text-guided synthesis. This paradigm illustrates how higher-dimensional content is synthesized by extending foundational modalities along spatial and temporal axes. (1) 2D$\to$3Dwang2023scorepoole2024dreamfusionlin2023magic3dwang2024prolificdreamer: Spatial lifting of 2D priors to achieve geometric consistency; (2) 2D$\to$Videobar2024lumiereblattmann2023stablesinger2022make: Temporal inflation of static features to capture motion dynamics; (3) Video$\to$4Djiang2023consistent4dwu2025cat4dwu2025sc4dzhang20244diffusion: Spatial reconstruction and stabilization of dynamic sequences; (4) 3D$\to$4Dsinger2023textbah20244dfyren2023dreamgaussian4dyu20244real: Temporal animation and deformation of static geometry. This perspective underscores that higher-dimensional generation methodologies are derivatives of foundational lower-dimensional generative priors, adapted through specialized architectural extensions.
  • Figure 3: An illustration of the video generation paradigm. Video generation models are constructed on top of image generation models by adding temporal layers or from scratch.
  • Figure 4: Qualitative comparison between different video generation methods. Results are obtained from Movie Gen polyak2024movie.
  • Figure 5: Three main categories of neural scene representations. (a) Explicit representation stores geometry directly using point clouds, voxel grids gupta20203d, meshes rossi2021robust, and 3D Gaussians kerbl20233d. (b) Implicit representation defines objects through functions like Signed Distance Functions (SDF) park2019deepsdf and Neural Radiance Fields (NeRF) mildenhall2020nerf, enabling smooth, continuous surfaces without fixed resolution. (c) Hybrid representation combines explicit and implicit methods, using techniques like Hybrid Voxel Grids, Deep Marching Tetrahedra (DMTet) shen2021dmtet, and Triplanes for better efficiency and flexibility.
  • ...and 10 more figures