Table of Contents
Fetching ...

A Survey of World Models for Autonomous Driving

Tuo Feng, Wenguan Wang, Yi Yang

TL;DR

This survey surveys world-model approaches for autonomous driving, organizing the landscape into generation of the future physical world, behavior planning, and their interactive coupling. It emphasizes four generation modalities (image, BEV, OG, PC) and three planning paradigms (learning-based, rule-based, search-based), along with increasingly integrated planning-prediction loops and self-supervised, large-scale pretraining, and synthetic-data strategies. Key contributions include a taxonomy, benchmarks, and a roadmap highlighting diffusion-based generation, multi-modal fusion, advanced simulators, and efficient architectures as central directions for robust, real-world deployment. The findings suggest that closed-loop, controllable, multi-sensor world models paired with rich data generation will significantly enhance safety, reliability, and scalability of autonomous driving systems, particularly in long-horizon and edge-case scenarios.

Abstract

Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling, fundamentally transforming how vehicles interpret dynamic scenes and execute safe decision-making. World models have emerged as a linchpin technology, offering high-fidelity representations of the driving environment that integrate multi-sensor data, semantic cues, and temporal dynamics. This paper systematically reviews recent advances in world models for autonomous driving, proposing a three-tiered taxonomy: (i) Generation of Future Physical World, covering Image-, BEV-, OG-, and PC-based generation methods that enhance scene evolution modeling through diffusion models and 4D occupancy forecasting; (ii) Behavior Planning for Intelligent Agents, combining rule-driven and learning-based paradigms with cost map optimization and reinforcement learning for trajectory generation in complex traffic conditions; (ii) Interaction between Prediction and Planning, achieving multi-agent collaborative decision-making through latent space diffusion and memory-augmented architectures. The study further analyzes training paradigms, including self-supervised learning, multimodal pretraining, and generative data augmentation, while evaluating world models' performance in scene understanding and motion prediction tasks. Future research must address key challenges in self-supervised representation learning, multimodal fusion, and advanced simulation to advance the practical deployment of world models in complex urban environments. Overall, the comprehensive analysis provides a technical roadmap for harnessing the transformative potential of world models in advancing safe and reliable autonomous driving solutions.

A Survey of World Models for Autonomous Driving

TL;DR

This survey surveys world-model approaches for autonomous driving, organizing the landscape into generation of the future physical world, behavior planning, and their interactive coupling. It emphasizes four generation modalities (image, BEV, OG, PC) and three planning paradigms (learning-based, rule-based, search-based), along with increasingly integrated planning-prediction loops and self-supervised, large-scale pretraining, and synthetic-data strategies. Key contributions include a taxonomy, benchmarks, and a roadmap highlighting diffusion-based generation, multi-modal fusion, advanced simulators, and efficient architectures as central directions for robust, real-world deployment. The findings suggest that closed-loop, controllable, multi-sensor world models paired with rich data generation will significantly enhance safety, reliability, and scalability of autonomous driving systems, particularly in long-horizon and edge-case scenarios.

Abstract

Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling, fundamentally transforming how vehicles interpret dynamic scenes and execute safe decision-making. World models have emerged as a linchpin technology, offering high-fidelity representations of the driving environment that integrate multi-sensor data, semantic cues, and temporal dynamics. This paper systematically reviews recent advances in world models for autonomous driving, proposing a three-tiered taxonomy: (i) Generation of Future Physical World, covering Image-, BEV-, OG-, and PC-based generation methods that enhance scene evolution modeling through diffusion models and 4D occupancy forecasting; (ii) Behavior Planning for Intelligent Agents, combining rule-driven and learning-based paradigms with cost map optimization and reinforcement learning for trajectory generation in complex traffic conditions; (ii) Interaction between Prediction and Planning, achieving multi-agent collaborative decision-making through latent space diffusion and memory-augmented architectures. The study further analyzes training paradigms, including self-supervised learning, multimodal pretraining, and generative data augmentation, while evaluating world models' performance in scene understanding and motion prediction tasks. Future research must address key challenges in self-supervised representation learning, multimodal fusion, and advanced simulation to advance the practical deployment of world models in complex urban environments. Overall, the comprehensive analysis provides a technical roadmap for harnessing the transformative potential of world models in advancing safe and reliable autonomous driving solutions.
Paper Structure (36 sections, 1 equation, 5 figures, 9 tables)

This paper contains 36 sections, 1 equation, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Structure of the overall review (§\ref{['sec:intro']}). The top row outlines the organization. The second and third rows illustrate the background and key components. The fourth row highlights various methodologies for training models in autonomous driving. The bottom row showcases four application areas for world models in autonomous driving.
  • Figure 2: The paradigms of generation methods for future physical world (§\ref{['sec:methodologies_future_prediction']}). (a) Image-based generation synthesizes high-fidelity 2D images. (b) BEV-based generation forecasts BEV maps using paired image and BEV cues. (c) OG-based generation predicts 4D occupancy grids. (d) PC-based generation outputs future LiDAR sweeps. Boxes indicate processing modules; solid arrows denote data flow.
  • Figure 3: The evolution of interaction (§\ref{['sec:methodologies_interaction']}). (a) Open-loop regime synthesises scenes from logged data under static conditions; the generator memorises scenarios but never reacts to new actions. (b) Uncontrollable closed-loop regime unifies perception, prediction and planning in an auto-regressive loop, yet latent physics remain fixed, so users cannot inject rules or rare events. (c) Controllable closed-loop regime adds editable 4D worlds, occupancy control and feedback/backward signals, enabling a drive agent to interact safely with a fully testable simulator.
  • Figure 4: (a) Self-supervised world models lift multi-view images into 3D volumes, forecast future grids, and learn from 2D renders plus occupancy cues, reducing labels (§\ref{['sec:Self-Supervised_Learning']}). (b) Large-scale pre-training on visual point-cloud sequences yields a single backbone that adapts to detection, tracking, mapping, occupancy, and planning (§\ref{['sec:data_training_pretraining']}).
  • Figure 5: Application areas and tasks (§\ref{['sec:application']}). (a) Scene understanding fuses multi-camera/LiDAR into a 4D BEV latent that updates every frame, giving precise geometry. (b) Motion prediction rolls latent forward under action cues, outputting collision fields for planners. (c) Simulation turns the generator into an editable 4D world where diffusion seeds rare traffic for safe closed-loop tests. (d) End-to-end driving joins perception, prediction and control in one autoregressive policy.