Table of Contents
Fetching ...

GenAD: Generative End-to-End Autonomous Driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, Long Chen

TL;DR

GenAD reframes end-to-end autonomous driving as a trajectory-generation task in a structured latent space, enabling joint motion prediction and planning while modeling high-order ego–agent interactions through an instance-centric scene representation. A variational autoencoder establishes a trajectory prior in latent space, and a GRU-based generator decodes futures over time, enabling efficient, probabilistic planning conditioned on scene tokens. The approach yields state-of-the-art planning performance on nuScenes with competitive inference speed, and ablations confirm the value of the instance-centric representation and the generative prior. This framework advances vision-based driving by explicitly modeling the uncertainty and structure of future trajectories rather than relying on serial perception-prediction-planning pipelines.

Abstract

Directly producing planning results from raw sensors has been a long-desired solution for autonomous driving and has attracted increasing attention recently. Most existing end-to-end autonomous driving methods factorize this problem into perception, motion prediction, and planning. However, we argue that the conventional progressive pipeline still cannot comprehensively model the entire traffic evolution process, e.g., the future interaction between the ego car and other traffic participants and the structural trajectory prior. In this paper, we explore a new paradigm for end-to-end autonomous driving, where the key is to predict how the ego car and the surroundings evolve given past scenes. We propose GenAD, a generative framework that casts autonomous driving into a generative modeling problem. We propose an instance-centric scene tokenizer that first transforms the surrounding scenes into map-aware instance tokens. We then employ a variational autoencoder to learn the future trajectory distribution in a structural latent space for trajectory prior modeling. We further adopt a temporal model to capture the agent and ego movements in the latent space to generate more effective future trajectories. GenAD finally simultaneously performs motion prediction and planning by sampling distributions in the learned structural latent space conditioned on the instance tokens and using the learned temporal model to generate futures. Extensive experiments on the widely used nuScenes benchmark show that the proposed GenAD achieves state-of-the-art performance on vision-centric end-to-end autonomous driving with high efficiency. Code: https://github.com/wzzheng/GenAD.

GenAD: Generative End-to-End Autonomous Driving

TL;DR

GenAD reframes end-to-end autonomous driving as a trajectory-generation task in a structured latent space, enabling joint motion prediction and planning while modeling high-order ego–agent interactions through an instance-centric scene representation. A variational autoencoder establishes a trajectory prior in latent space, and a GRU-based generator decodes futures over time, enabling efficient, probabilistic planning conditioned on scene tokens. The approach yields state-of-the-art planning performance on nuScenes with competitive inference speed, and ablations confirm the value of the instance-centric representation and the generative prior. This framework advances vision-based driving by explicitly modeling the uncertainty and structure of future trajectories rather than relying on serial perception-prediction-planning pipelines.

Abstract

Directly producing planning results from raw sensors has been a long-desired solution for autonomous driving and has attracted increasing attention recently. Most existing end-to-end autonomous driving methods factorize this problem into perception, motion prediction, and planning. However, we argue that the conventional progressive pipeline still cannot comprehensively model the entire traffic evolution process, e.g., the future interaction between the ego car and other traffic participants and the structural trajectory prior. In this paper, we explore a new paradigm for end-to-end autonomous driving, where the key is to predict how the ego car and the surroundings evolve given past scenes. We propose GenAD, a generative framework that casts autonomous driving into a generative modeling problem. We propose an instance-centric scene tokenizer that first transforms the surrounding scenes into map-aware instance tokens. We then employ a variational autoencoder to learn the future trajectory distribution in a structural latent space for trajectory prior modeling. We further adopt a temporal model to capture the agent and ego movements in the latent space to generate more effective future trajectories. GenAD finally simultaneously performs motion prediction and planning by sampling distributions in the learned structural latent space conditioned on the instance tokens and using the learned temporal model to generate futures. Extensive experiments on the widely used nuScenes benchmark show that the proposed GenAD achieves state-of-the-art performance on vision-centric end-to-end autonomous driving with high efficiency. Code: https://github.com/wzzheng/GenAD.
Paper Structure (13 sections, 12 equations, 4 figures, 4 tables)

This paper contains 13 sections, 12 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparisons of the proposed generative end-to-end autonomous driving framework with the conventional pipeline. Most existing methods follow a serial design of perception, prediction, and planning. They usually ignore the high-level interactions between the ego car and other agents and the structural prior of realistic trajectories. We model autonomous driving as a future generation problem and conduct motion prediction and ego planning simultaneously in a structural latent trajectory space.
  • Figure 2: Framework of our generative end-to-end autonomous driving. Given surrounding images as inputs, we employ an image backbone to extract multi-scale features and then use a BEV encoder to obtain BEV tokens. We then use cross-attention and deformable cross-attention to transform BEV tokens into map and agent tokens, respectively. With an additional ego token, we use self-attention to enable ego-agent interactions and cross-attention to further incorporate map information to obtain the instance-centric scene representation. We map this representation to a structural latent trajectory space which is jointly learned using ground-truth future trajectories. Finally, we employ a future trajectory generator to produce future trajectories to simultaneously complete motion prediction and planning.
  • Figure 3: Illustration of the proposed trajectory prior modeling and future generation. We use a future trajectory encoder to map ground-truth trajectories to a latent trajectory space, where we use the Gaussian distribution to model the trajectory uncertainty. We then employ a gate recurrent unit (GRU) to progressively predict the next future in the latent space and use a decoder to obtain explicit trajectories.
  • Figure 4: Visualizations of the results of GenAD with comparisons with VAD vad. We provide perception, motion prediction, and planning results with surrounding camera inputs.