Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

Pengxiang Li; Yinan Zheng; Yue Wang; Huimin Wang; Hang Zhao; Jingjing Liu; Xianyuan Zhan; Kun Zhan; Xianpeng Lang

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

Pengxiang Li, Yinan Zheng, Yue Wang, Huimin Wang, Hang Zhao, Jingjing Liu, Xianyuan Zhan, Kun Zhan, Xianpeng Lang

TL;DR

The paper tackles safety challenges in End-to-End Vision-Language-Action autonomous driving by introducing ReflectDrive, a discrete diffusion-based planner that discretizes the 2D driving space into an action codebook and employs a reflection mechanism to enforce safety without gradient-based optimization. It combines goal-conditioned generation with a gradient-free Safety-Guided Regeneration loop, using scoring functions $S_{global}$, $S_{safe}$, and $S_{local}$ to iteratively refine trajectories via inpainting. Evaluated on the NAVSIM benchmark, ReflectDrive achieves near-human performance in safety-critical metrics, with substantial gains over baseline E2E planners, and shows strong improvements when comparing to ground-truth agent states. The work demonstrates that discrete diffusion coupled with a safety-centered reflective inference can provide scalable, interpretable, and reliable planning for autonomous driving, potentially reducing reliance on post-hoc rule-based refinements.

Abstract

End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

TL;DR

Abstract

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)