Table of Contents
Fetching ...

DiffAD: A Unified Diffusion Modeling Approach for Autonomous Driving

Tao Wang, Cong Zhang, Xingguang Qu, Kun Li, Weiwei Liu, Chang Huang

TL;DR

DiffAD reframes end-to-end autonomous driving as conditional image generation on a unified rasterized BEV, addressing coordination and complexity issues in prior modular and E2E pipelines. It introduces a latent diffusion framework with AdaLN-conditioned denoising, a three-canvas BEV representation, and a Trajectory Extraction Network to jointly learn perception, prediction, and planning. The approach yields state-of-the-art closed-loop performance on CARLA Bench2Drive, with ablations confirming the benefits of joint task optimization, denoising progression, and modality fusion. This work demonstrates the potential of diffusion-based generative modeling to simplify autonomous driving architectures while improving robustness and planning coherence.

Abstract

End-to-end autonomous driving (E2E-AD) has rapidly emerged as a promising approach toward achieving full autonomy. However, existing E2E-AD systems typically adopt a traditional multi-task framework, addressing perception, prediction, and planning tasks through separate task-specific heads. Despite being trained in a fully differentiable manner, they still encounter issues with task coordination, and the system complexity remains high. In this work, we introduce DiffAD, a novel diffusion probabilistic model that redefines autonomous driving as a conditional image generation task. By rasterizing heterogeneous targets onto a unified bird's-eye view (BEV) and modeling their latent distribution, DiffAD unifies various driving objectives and jointly optimizes all driving tasks in a single framework, significantly reducing system complexity and harmonizing task coordination. The reverse process iteratively refines the generated BEV image, resulting in more robust and realistic driving behaviors. Closed-loop evaluations in Carla demonstrate the superiority of the proposed method, achieving a new state-of-the-art Success Rate and Driving Score.

DiffAD: A Unified Diffusion Modeling Approach for Autonomous Driving

TL;DR

DiffAD reframes end-to-end autonomous driving as conditional image generation on a unified rasterized BEV, addressing coordination and complexity issues in prior modular and E2E pipelines. It introduces a latent diffusion framework with AdaLN-conditioned denoising, a three-canvas BEV representation, and a Trajectory Extraction Network to jointly learn perception, prediction, and planning. The approach yields state-of-the-art closed-loop performance on CARLA Bench2Drive, with ablations confirming the benefits of joint task optimization, denoising progression, and modality fusion. This work demonstrates the potential of diffusion-based generative modeling to simplify autonomous driving architectures while improving robustness and planning coherence.

Abstract

End-to-end autonomous driving (E2E-AD) has rapidly emerged as a promising approach toward achieving full autonomy. However, existing E2E-AD systems typically adopt a traditional multi-task framework, addressing perception, prediction, and planning tasks through separate task-specific heads. Despite being trained in a fully differentiable manner, they still encounter issues with task coordination, and the system complexity remains high. In this work, we introduce DiffAD, a novel diffusion probabilistic model that redefines autonomous driving as a conditional image generation task. By rasterizing heterogeneous targets onto a unified bird's-eye view (BEV) and modeling their latent distribution, DiffAD unifies various driving objectives and jointly optimizes all driving tasks in a single framework, significantly reducing system complexity and harmonizing task coordination. The reverse process iteratively refines the generated BEV image, resulting in more robust and realistic driving behaviors. Closed-loop evaluations in Carla demonstrate the superiority of the proposed method, achieving a new state-of-the-art Success Rate and Driving Score.

Paper Structure

This paper contains 37 sections, 12 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Paradigm overview. (a) Module-E2E adopts sequential pipelines where multi-task heads is optimized in a differentiable manner. (b) DiffAD (ours) integrates all components into a single denoising head and treat E2E-AD as a conditional image generation task, resulting a fully end-to-end joint optimization of all driving tasks.
  • Figure 2: Pipeline of DiffAD. Training Process: (a) DiffAD rasterizes perception, prediction, and planning targets onto a BEV image, which is encoded into a latent space $\mathbf{x_0}$ using a VAE. (b) Surrounding images are transformed into BEV feature. (c) A diffusion model predicts noise $\mathbf{\epsilon_\theta}$ from the noisy latent BEV image, and (d) a trajectory extraction network (TEN) learns to recover the ego trajectory from the latent BEV image. Inference Process: (1) DiffAD generates a denoised latent BEV image $\mathbf{\hat{x}_0}$ from pure Gaussian noise $\mathbf{x}_T$, conditioned on BEV feature, (2) extracts the ego trajectory via TEN, and (3) decodes the latent BEV image for interpretation.
  • Figure 3: Status Distribution.
  • Figure 4: Demonstration of the Model's Multi-Modal Decision-Making
  • Figure 5: Comparison of different DiT configs.
  • ...and 1 more figures