Table of Contents
Fetching ...

Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis

Yuan Gao, Mattia Piccinini, Yuchen Zhang, Dingrui Wang, Korbinian Moller, Roberto Brusnicki, Baha Zarrouki, Alessio Gambi, Jan Frederik Totz, Kai Storms, Steven Peters, Andrea Stocco, Bassam Alrifaee, Marco Pavone, Johannes Betz

TL;DR

This survey maps the rising role of foundation models in autonomous driving, centering on scenario generation and scenario analysis. It provides a unified taxonomy across LLMs, VLMs, multimodal LLMs, diffusion models, and world models, and synthesizes datasets, simulators, and benchmarks essential for evaluation. The work identifies open challenges—realism vs. edge-case coverage, multimodal data gaps, standardized metrics, and industrial validation—and outlines concrete future directions to advance FM-driven scenario testing in AD. Collectively, it offers a comprehensive roadmap for integrating foundation models into safety-aware, scalable scenario-based validation pipelines for autonomous vehicles.

Abstract

For autonomous vehicles, safe navigation in complex environments depends on handling a broad range of diverse and rare driving scenarios. Simulation- and scenario-based testing have emerged as key approaches to development and validation of autonomous driving systems. Traditional scenario generation relies on rule-based systems, knowledge-driven models, and data-driven synthesis, often producing limited diversity and unrealistic safety-critical cases. With the emergence of foundation models, which represent a new generation of pre-trained, general-purpose AI models, developers can process heterogeneous inputs (e.g., natural language, sensor data, HD maps, and control actions), enabling the synthesis and interpretation of complex driving scenarios. In this paper, we conduct a survey about the application of foundation models for scenario generation and scenario analysis in autonomous driving (as of May 2025). Our survey presents a unified taxonomy that includes large language models, vision-language models, multimodal large language models, diffusion models, and world models for the generation and analysis of autonomous driving scenarios. In addition, we review the methodologies, open-source datasets, simulation platforms, and benchmark challenges, and we examine the evaluation metrics tailored explicitly to scenario generation and analysis. Finally, the survey concludes by highlighting the open challenges and research questions, and outlining promising future research directions. All reviewed papers are listed in a continuously maintained repository, which contains supplementary materials and is available at https://github.com/TUM-AVS/FM-for-Scenario-Generation-Analysis.

Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis

TL;DR

This survey maps the rising role of foundation models in autonomous driving, centering on scenario generation and scenario analysis. It provides a unified taxonomy across LLMs, VLMs, multimodal LLMs, diffusion models, and world models, and synthesizes datasets, simulators, and benchmarks essential for evaluation. The work identifies open challenges—realism vs. edge-case coverage, multimodal data gaps, standardized metrics, and industrial validation—and outlines concrete future directions to advance FM-driven scenario testing in AD. Collectively, it offers a comprehensive roadmap for integrating foundation models into safety-aware, scalable scenario-based validation pipelines for autonomous vehicles.

Abstract

For autonomous vehicles, safe navigation in complex environments depends on handling a broad range of diverse and rare driving scenarios. Simulation- and scenario-based testing have emerged as key approaches to development and validation of autonomous driving systems. Traditional scenario generation relies on rule-based systems, knowledge-driven models, and data-driven synthesis, often producing limited diversity and unrealistic safety-critical cases. With the emergence of foundation models, which represent a new generation of pre-trained, general-purpose AI models, developers can process heterogeneous inputs (e.g., natural language, sensor data, HD maps, and control actions), enabling the synthesis and interpretation of complex driving scenarios. In this paper, we conduct a survey about the application of foundation models for scenario generation and scenario analysis in autonomous driving (as of May 2025). Our survey presents a unified taxonomy that includes large language models, vision-language models, multimodal large language models, diffusion models, and world models for the generation and analysis of autonomous driving scenarios. In addition, we review the methodologies, open-source datasets, simulation platforms, and benchmark challenges, and we examine the evaluation metrics tailored explicitly to scenario generation and analysis. Finally, the survey concludes by highlighting the open challenges and research questions, and outlining promising future research directions. All reviewed papers are listed in a continuously maintained repository, which contains supplementary materials and is available at https://github.com/TUM-AVS/FM-for-Scenario-Generation-Analysis.

Paper Structure

This paper contains 38 sections, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Overview of applied to scenario generation and analysis for autonomous driving, and the corresponding structure of this survey.
  • Figure 2: Examples of driving scenarios in autonomous driving: datasets and simulations used for scenario-based testing. Sensor data such as camera images, videos, and LiDAR point clouds derived from these scenarios can be used to evaluate perception algorithms. Concurrently, simulator-specific scenario formats support rigorous testing of planning and control modules. Top row (left to right): Waymo Open motion ettinger2021large dataset; Argoverse2 wilson2023argoverse dash camera video; NuPlan karnchanachari2024nuplan multi-camera views with map overlays. Bottom row (left to right): CommonRoad althoff2017commonroad motion planning scenario; CARLA dosovitskiy2017carla simulated urban scenario; SUMO Lopez2018-sumo large-scale traffic scenario.
  • Figure 3: Pre-trained VLMs use both text descriptions and visual inputs for two tasks: (1) scenario generation using text prompts and scene images, and (2) scenario analysis using image understanding and textual reasoning for risk assessment.
  • Figure 4: Overview of adaptation techniques for in autonomous driving. Encoders extract features from modality-specific inputs. Projectors are trainable modules that map features into the ’s embedding space to enable cross-modal alignment. The serves as the reasoning core and can be frozen or trainable, depending on the available resources and the task, using fine-tuning techniques.
  • Figure 5: An illustration of how a DM transforms a clean image into noise through the forward process, and then reconstructs it in reverse during the backward process.
  • ...and 2 more figures