Table of Contents
Fetching ...

VERDI: VLM-Embedded Reasoning for Autonomous Driving

Bowen Feng, Zhiting Mei, Baiang Li, Julian Ost, Filippo Ghilotti, Roger Girgis, Anirudha Majumdar, Felix Heide

TL;DR

VERDI addresses decision making under partial observability in autonomous driving by distilling the reasoning of large Vision-Language Models into a lightweight, modular end-to-end driving stack. It prompts a VLM to generate reasoning for perception, prediction, and planning, encodes those responses into latent language embeddings, and aligns them with corresponding e2e module representations through Progressive Feature Projectors. The training-time distillation yields open-loop and closed-loop improvements on nuScenes, Bench2Drive, and HugSim, while preserving real-time inference without runtime VLM queries. This approach reduces reliance on slow VLMs and enhances safety by integrating commonsense reasoning into the driving pipeline. Ablation results confirm the contribution of each aligned module and the impact of VLM embedding quality on performance.

Abstract

While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous Driving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, VERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We validate VERDI in both open-loop (NuScenes and Bench2Drive benchmarks) and closed-loop (HugSim Simulator) settings. We find that VERDI outperforms existing e2e methods that do not embed reasoning by up to 11% in $\ell_{2}$ distance and 11% in driving performance, while maintaining real-time inference speed.

VERDI: VLM-Embedded Reasoning for Autonomous Driving

TL;DR

VERDI addresses decision making under partial observability in autonomous driving by distilling the reasoning of large Vision-Language Models into a lightweight, modular end-to-end driving stack. It prompts a VLM to generate reasoning for perception, prediction, and planning, encodes those responses into latent language embeddings, and aligns them with corresponding e2e module representations through Progressive Feature Projectors. The training-time distillation yields open-loop and closed-loop improvements on nuScenes, Bench2Drive, and HugSim, while preserving real-time inference without runtime VLM queries. This approach reduces reliance on slow VLMs and enhances safety by integrating commonsense reasoning into the driving pipeline. Ablation results confirm the contribution of each aligned module and the impact of VLM embedding quality on performance.

Abstract

While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous Driving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, VERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We validate VERDI in both open-loop (NuScenes and Bench2Drive benchmarks) and closed-loop (HugSim Simulator) settings. We find that VERDI outperforms existing e2e methods that do not embed reasoning by up to 11% in distance and 11% in driving performance, while maintaining real-time inference speed.
Paper Structure (21 sections, 5 equations, 10 figures, 7 tables)

This paper contains 21 sections, 5 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of VERDI. Our pipeline aligns the VLM reasoning module with our e2e driving model. During training, the ground truth (GT) trajectory and observed images are provided to the VLM for it to explain the reasoning throughout perception, prediction, and planning during the driving process. The VLM's answers to each submodule is aligned with the corresponding submodule outputs from the e2e driving model, effectively distilling VLM knowledge and reasoning into the e2e model. During inference time, the e2e model plans future trajectory with embedded reasoning process, without having to query the VLM (pink arrow).
  • Figure 2: Obtaining description features through chain-of-thought prompting and text encoder. For each query, the prompt consists of the system prompt, the observed images, the ego vehicle trajectory, the respective question, as well as the answers to the upstream modules (if any). The VLM answers to each module are encoded and mapped to a latent feature space.
  • Figure 3: VERDI Training. The e2e model is trained with VERDI for the individual perception, prediction, and planning modules. All relevant feature maps $F$ and $Q$ are first mapped to a feature $f_{\mathcal{P}}$ in a representation space, which is shared with the encoded language features $f_{\mathcal{M}}$. This mapping is facilitated by VERDI's trainable $\text{PFP}$ layers. The perception outputs $F_{\texttt{perception}}$ including the extracted image features, are directly supervised with the encoded VLM features. In the subsequent modules, all features are supervised. $\mathcal{L}_f$ computes their similarity.
  • Figure 4: Qualitative comparison of VERDI (Ours, right column) and the Supervised e2e model (baseline, left column) on the nuScenes dataset caesar2020nuscenes. Each entry shows the multi-view camera observations on the left and the BEV view on the right at one time step $t$. The left panel overlays the ego agent’s planned 3-second trajectory on the front-camera image and BEV panel as a solid green line that fades to blue. The BEV panel renders the ego vehicle as a green rectangle, pedestrians and other vehicles as red rectangles, and their predicted 3-second trajectories as red lines. Each example shows our successful performance on the perception, prediction, and planning modules, indicated by $\square$, while failures are highlighted by $\square$. We also show the VLM texts response for each case used during training time.
  • Figure 5: OpenEMMA Testing Example (Bus Scene) on the nuScenes Dataset caesar2020nuscenes. In the front-view image, OpenEMMA’s projected future path is overlaid as light blue. An orange bus occupies the same lane, traveling in the same direction as the ego vehicle. Solid white lane markings run along the right side, with a white-striped curb on the left. OpenEMMA erroneously plans a leftward trajectory, which would result in a collision with that curb.
  • ...and 5 more figures