Table of Contents
Fetching ...

DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving

Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, Xinggang Wang

TL;DR

DiffusionDriveV2 tackles mode collapse in diffusion-based end-to-end autonomous driving by introducing reinforcement-learning constraints across all modes and scale-adaptive exploration. It introduces Intra-Anchor GRPO to perform group advantage estimation within each anchor and Inter-Anchor Truncated GRPO to maintain global learning signals across anchors, coupled with a scale-adaptive multiplicative noise scheme. A two-stage mode selector with Margin-Rank loss selects the best goal-aligned trajectory from multi-modal predictions. On NAVSIM v1 and v2 with a ResNet-34 backbone, DiffusionDriveV2 achieves state-of-the-art PDMS/EPDMS scores and demonstrates a superior balance between trajectory diversity and consistent high quality.

Abstract

Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse, tending to generate conservative and homogeneous behaviors. While DiffusionDrive employs predefined anchors representing different driving intentions to partition the action space and generate diverse trajectories, its reliance on imitation learning lacks sufficient constraints, resulting in a dilemma between diversity and consistent high quality. In this work, we propose DiffusionDriveV2, which leverages reinforcement learning to both constrain low-quality modes and explore for superior trajectories. This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model. First, we use scale-adaptive multiplicative noise, ideal for trajectory planning, to promote broad exploration. Second, we employ intra-anchor GRPO to manage advantage estimation among samples generated from a single anchor, and inter-anchor truncated GRPO to incorporate a global perspective across different anchors, preventing improper advantage comparisons between distinct intentions (e.g., turning vs. going straight), which can lead to further mode collapse. DiffusionDriveV2 achieves 91.2 PDMS on the NAVSIM v1 dataset and 85.5 EPDMS on the NAVSIM v2 dataset in closed-loop evaluation with an aligned ResNet-34 backbone, setting a new record. Further experiments validate that our approach resolves the dilemma between diversity and consistent high quality for truncated diffusion models, achieving the best trade-off. Code and model will be available at https://github.com/hustvl/DiffusionDriveV2

DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving

TL;DR

DiffusionDriveV2 tackles mode collapse in diffusion-based end-to-end autonomous driving by introducing reinforcement-learning constraints across all modes and scale-adaptive exploration. It introduces Intra-Anchor GRPO to perform group advantage estimation within each anchor and Inter-Anchor Truncated GRPO to maintain global learning signals across anchors, coupled with a scale-adaptive multiplicative noise scheme. A two-stage mode selector with Margin-Rank loss selects the best goal-aligned trajectory from multi-modal predictions. On NAVSIM v1 and v2 with a ResNet-34 backbone, DiffusionDriveV2 achieves state-of-the-art PDMS/EPDMS scores and demonstrates a superior balance between trajectory diversity and consistent high quality.

Abstract

Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse, tending to generate conservative and homogeneous behaviors. While DiffusionDrive employs predefined anchors representing different driving intentions to partition the action space and generate diverse trajectories, its reliance on imitation learning lacks sufficient constraints, resulting in a dilemma between diversity and consistent high quality. In this work, we propose DiffusionDriveV2, which leverages reinforcement learning to both constrain low-quality modes and explore for superior trajectories. This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model. First, we use scale-adaptive multiplicative noise, ideal for trajectory planning, to promote broad exploration. Second, we employ intra-anchor GRPO to manage advantage estimation among samples generated from a single anchor, and inter-anchor truncated GRPO to incorporate a global perspective across different anchors, preventing improper advantage comparisons between distinct intentions (e.g., turning vs. going straight), which can lead to further mode collapse. DiffusionDriveV2 achieves 91.2 PDMS on the NAVSIM v1 dataset and 85.5 EPDMS on the NAVSIM v2 dataset in closed-loop evaluation with an aligned ResNet-34 backbone, setting a new record. Further experiments validate that our approach resolves the dilemma between diversity and consistent high quality for truncated diffusion models, achieving the best trade-off. Code and model will be available at https://github.com/hustvl/DiffusionDriveV2

Paper Structure

This paper contains 37 sections, 15 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison of various models. (a) Vanilla Diffusion models are prone to mode collapse, collapsing diverse possibilities into a single trajectory. (b) DiffusionDrive generates trajectories with excellent multimodality, yet constrained by imitation learning, it also produces numerous colliding ones (circled in red) as most negative modes lack supervision during training, posing a major threat to the system's overall quality. (c) DiffusionDriveV2 leverages reinforcement learning to apply constraints to multi-modal trajectories, guiding the model to generate both diverse and consistent high-quality trajectories.
  • Figure 2: Overall architecture of DiffusionDriveV2. Trajectories of different colors represent distinct anchored intents. Solid lines indicate high-quality trajectories, while dashed lines indicate low-quality ones. The truncated diffusion decoder, limited by incomplete supervision in IL, produces low-quality trajectories (overtake, right turn) alongside high-quality ones (go straight). To address this, we first apply multiplicative Gaussian noise to push the model to explore the nearby action space. We then propose Anchored Truncated GRPO, which performs intra-group advantage estimation to optimize the model, steering it away from collisions and towards high-quality trajectories. The resulting refined trajectories for overtake and right turn become collision-free, while the go straight trajectories become more optimal rather than overly conservative. Finally, a mode selector chooses the most goal-aligned trajectory from the refined trajectories.
  • Figure 3: Comparison with Different Noise Strategies for Exploration. The green solid line denotes the original trajectory, while the blue and red dashed lines represent the trajectories after applying exploration noise.
  • Figure 4: Qualitative comparison of Vanilla Diffusion, DiffusionDrive, and DiffusionDriveV2 on going straight scenarios of NAVSIM navtest split.
  • Figure 5: Qualitative comparison of Vanilla Diffusion, DiffusionDrive, and DiffusionDriveV2 on going straight scenarios of NAVSIM navtest split.
  • ...and 3 more figures