Table of Contents
Fetching ...

dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang, Boris Ivanovic, Ming Jiang, Marco Pavone, Chaowei Xiao

TL;DR

This paper addresses end-to-end autonomous driving under distribution shifts by replacing autoregressive vision-language models with a diffusion-based dVLM-AD that unifies perception, reasoning, and planning via bidirectional denoising and template-anchored controlled decoding. It introduces a dynamic denoising strategy and a two-stage training regimen (145k driving QA alignments plus 23k/30k structured annotations) to achieve stronger reasoning–action consistency while maintaining competitive planning performance on nuScenes and Waymo Open Dataset End-to-End. Using textual waypoints and a relatively compact LLaDA-V backbone, the approach attains superior consistency and robustness against prompt perturbations, outperforming AR baselines in long-tail driving scenarios. Overall, diffusion-based VLMs offer a scalable, reliable pathway for safe and interpretable end-to-end driving with controllable reasoning.

Abstract

The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.

dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

TL;DR

This paper addresses end-to-end autonomous driving under distribution shifts by replacing autoregressive vision-language models with a diffusion-based dVLM-AD that unifies perception, reasoning, and planning via bidirectional denoising and template-anchored controlled decoding. It introduces a dynamic denoising strategy and a two-stage training regimen (145k driving QA alignments plus 23k/30k structured annotations) to achieve stronger reasoning–action consistency while maintaining competitive planning performance on nuScenes and Waymo Open Dataset End-to-End. Using textual waypoints and a relatively compact LLaDA-V backbone, the approach attains superior consistency and robustness against prompt perturbations, outperforming AR baselines in long-tail driving scenarios. Overall, diffusion-based VLMs offer a scalable, reliable pathway for safe and interpretable end-to-end driving with controllable reasoning.

Abstract

The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.

Paper Structure

This paper contains 29 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of end-to-end autonomous driving paradigms. (a) Autoregressive VLMs sequentially decode reasoning and textual action tokens, where each prediction depends on previous outputs, leading to accumulated exposure bias and limited global consistency. (b) VLAs introduce latent action tokens and a separate decoder to produce trajectories, but reasoning–action coupling remains implicit. (c) Our dVLM reformulate driving as an iterative denoising process that jointly refines reasoning and action representations under sensor and ego-state conditioning. This diffusion formulation eliminates left-to-right dependencies, enhances stability, and achieves stronger reasoning–action alignment within the end-to-end autonomous driving system.
  • Figure 2: Challenges in existing driving VLMs/VLAs. (1) Reasoning–action inconsistency: ARM-based VLMs’ sequential decoding can produce plans where the inferred behavior conflicts with the predicted trajectory (e.g., behavior mismatches the trajectory). Our dVLM performs template-anchored iterative refinement with bidirectional attention, enforcing cross-field consistency between meta-behavior and trajectory. (2) Uncontrollable generation: The structured reasoning of AR-VLMs is easily corrupted by prompt-level perturbations (e.g., bypassing reasoning steps), leading to broken formats and unstable outputs. In contrast, dVLM’s template-anchored fill-in-the-blank decoding with schema checks preserves order and semantics, yielding safe, consistent actions.
  • Figure 3: The overview of dVLM-AD framework.
  • Figure 4: Dynamic denoise strategy for controllable reasoning.
  • Figure 5: Examples of our dVLM-AD demonstrate stronger consistency between reasoning and action than autoregressive VLMs.