Table of Contents
Fetching ...

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

Pengxiang Li, Yinan Zheng, Yue Wang, Huimin Wang, Hang Zhao, Jingjing Liu, Xianyuan Zhan, Kun Zhan, Xianpeng Lang

TL;DR

The paper tackles safety challenges in End-to-End Vision-Language-Action autonomous driving by introducing ReflectDrive, a discrete diffusion-based planner that discretizes the 2D driving space into an action codebook and employs a reflection mechanism to enforce safety without gradient-based optimization. It combines goal-conditioned generation with a gradient-free Safety-Guided Regeneration loop, using scoring functions $S_{global}$, $S_{safe}$, and $S_{local}$ to iteratively refine trajectories via inpainting. Evaluated on the NAVSIM benchmark, ReflectDrive achieves near-human performance in safety-critical metrics, with substantial gains over baseline E2E planners, and shows strong improvements when comparing to ground-truth agent states. The work demonstrates that discrete diffusion coupled with a safety-centered reflective inference can provide scalable, interpretable, and reliable planning for autonomous driving, potentially reducing reliance on post-hoc rule-based refinements.

Abstract

End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

TL;DR

The paper tackles safety challenges in End-to-End Vision-Language-Action autonomous driving by introducing ReflectDrive, a discrete diffusion-based planner that discretizes the 2D driving space into an action codebook and employs a reflection mechanism to enforce safety without gradient-based optimization. It combines goal-conditioned generation with a gradient-free Safety-Guided Regeneration loop, using scoring functions , , and to iteratively refine trajectories via inpainting. Evaluated on the NAVSIM benchmark, ReflectDrive achieves near-human performance in safety-critical metrics, with substantial gains over baseline E2E planners, and shows strong improvements when comparing to ground-truth agent states. The work demonstrates that discrete diffusion coupled with a safety-centered reflective inference can provide scalable, interpretable, and reliable planning for autonomous driving, potentially reducing reliance on post-hoc rule-based refinements.

Abstract

End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.

Paper Structure

This paper contains 38 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: ReflectDrive Framework Overview.
  • Figure 2: Safety-Guided Regeneration Pipeline.
  • Figure 3: Safety-Guided Regeneration (S.G.R) Visualization. The first row illustrates three scenarios where large-angle turns are prone to boundary violations. The initial trajectories (lightest color) carry the risk of exceeding the boundaries. Using S.G.R, the trajectory is gradually optimized toward the safe region (with its color darkening progressively), ultimately resulting in a feasible trajectory. The second row depicts three scenarios involving intense interactions. Initial trajectories may pose collision risks with other vehicles or pedestrians. Through the iterative optimization of S.G.R., the trajectories learn to avoid conflicts or decelerate to yield, achieving much higher safety.
  • Figure 4: Ablation on (a) the number of generation steps for ReflectDrive (w/o R.I.), (b) the number of goal points for Goal-Conditioned Generation (G.C.G.), and (c) the numbers of exploration steps as well as max iterations for Safety-Guided Regeneration (S.G.R.).
  • Figure 5: Planning results that meet the PDM evaluation criteria.
  • ...and 1 more figures