Table of Contents
Fetching ...

3D and 4D World Modeling: A Survey

Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, Ziwei Liu

TL;DR

This survey defines and standardizes 3D/4D world modeling by organizing native representations into VideoGen, OccGen, and LiDARGen, and by detailing a hierarchical taxonomy of data engines, interpreters, simulators, and reconstructors. It surveys datasets, evaluation metrics, and benchmarks, and analyzes applications across autonomous driving, robotics, and digital twins. Key contributions include precise definitions, a structured taxonomy, and comprehensive coverage of datasets and evaluation protocols tailored to 3D/4D settings, along with open challenges and future directions. The work aims to unify the field and provide a foundational reference to advance geometry-grounded, controllable, and scalable world models for embodied AI and simulation.

Abstract

World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/awesome-3d-4d-world-models

3D and 4D World Modeling: A Survey

TL;DR

This survey defines and standardizes 3D/4D world modeling by organizing native representations into VideoGen, OccGen, and LiDARGen, and by detailing a hierarchical taxonomy of data engines, interpreters, simulators, and reconstructors. It surveys datasets, evaluation metrics, and benchmarks, and analyzes applications across autonomous driving, robotics, and digital twins. Key contributions include precise definitions, a structured taxonomy, and comprehensive coverage of datasets and evaluation protocols tailored to 3D/4D settings, along with open challenges and future directions. The work aims to unify the field and provide a foundational reference to advance geometry-grounded, controllable, and scalable world models for embodied AI and simulation.

Abstract

World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/awesome-3d-4d-world-models

Paper Structure

This paper contains 49 sections, 5 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Outline of the survey. This work focuses on native 3D and 4D representations: video streams, occupancy grids, and LiDAR point clouds, guided by geometric ($\mathcal{C}_{\mathrm{geo}}$), action-based ($\mathcal{C}_{\mathrm{act}}$), and semantic ($\mathcal{C}_{\mathrm{sem}}$) conditions (Sec. \ref{['sec:pre']}). Methods are framed under two paradigms, generative (synthesis from observations and conditions) and predictive (forecasting from history and actions), and grouped into four functional types (Sec. \ref{['sec:methods']}). We cover three modality tracks and standardize evaluations (Sec. \ref{['sec:datasets_evaluations']}), practical applications (Sec. \ref{['sec:applications']}), and future trends (Sec. \ref{['sec:challenges_future_directions']}) across diverse generation, forecasting, and downstream task perspectives.
  • Figure 2: Summary of representative video-based generation (VideoGen), occupancy-based generation (OccGen), and LiDAR-based generation (LiDARGen) models from existing literature. For the complete list of related methods and discussions on their specifications, configurations, and technical details, kindly refer to Sec. \ref{['sec:methods_videogen']}, Sec. \ref{['sec:methods_occgen']}, and Sec. \ref{['sec:methods_lidargen']}.
  • Figure 3: Summary of existing datasets & benchmarks used for training and evaluating VideoGen, OccGen, and LiDARGen models. For detailed configurations and statistics, kindly refer to Table \ref{['tab:comp-dataset']}. Images adopted from the original papers.
  • Figure 4: The categorization of VideoGen models based on functionalities, including data engines (Sec. \ref{['sec:videogen_data_engine']}), action interpreters (Sec. \ref{['sec:videogen_action_interpreter']}), and neural simulators (Sec. \ref{['sec:videogen_neural_simulator']}).
  • Figure 5: The categorization of OccGen models based on functionalities, including scene representors (Sec. \ref{['sec:occgen_scene_representor']}), forecasters (Sec. \ref{['sec:occgen_occupancy_forecaster']}), and autoregressive simulators (Sec. \ref{['sec:occgen_autoregressive_simulator']}).
  • ...and 5 more figures