Table of Contents
Fetching ...

VLMs Guided Interpretable Decision Making for Autonomous Driving

Xin Hu, Taotao Jing, Renran Tian, Zhengming Ding

TL;DR

The paper addresses the unreliability of direct decision-making by vision-language models in autonomous driving and proposes using VLMs as semantic enhancers that generate linguistically rich, spatially grounded scene descriptions. It introduces a dual-branch cross-attention architecture that fuses enriched textual descriptions with visual cues, trained via multi-instance learning, and complemented by a post-hoc VLM-based refinement module. Through experiments on the BDD-OIA and PSI driving benchmarks, the approach achieves state-of-the-art performance while providing interpretable decision explanations through CAMs and textual cues. The work demonstrates that semantic enrichment and structured multimodal fusion can yield more robust, explainable autonomous driving systems, highlighting a path toward reliable real-world deployment.

Abstract

Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.

VLMs Guided Interpretable Decision Making for Autonomous Driving

TL;DR

The paper addresses the unreliability of direct decision-making by vision-language models in autonomous driving and proposes using VLMs as semantic enhancers that generate linguistically rich, spatially grounded scene descriptions. It introduces a dual-branch cross-attention architecture that fuses enriched textual descriptions with visual cues, trained via multi-instance learning, and complemented by a post-hoc VLM-based refinement module. Through experiments on the BDD-OIA and PSI driving benchmarks, the approach achieves state-of-the-art performance while providing interpretable decision explanations through CAMs and textual cues. The work demonstrates that semantic enrichment and structured multimodal fusion can yield more robust, explainable autonomous driving systems, highlighting a path toward reliable real-world deployment.

Abstract

Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.

Paper Structure

This paper contains 16 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: (a) illustrates the conventional VLM method for decision making, which ignores the intra-relationship between local objects, and the inter-modality alignment between local objects and text description. We propose a framework exploring both intra- and inter-modality, as depicted in (b).
  • Figure 2: The failure case of GPT-4V for driver decision-making demonstrates that GPT-4V tends to generate vague and wrong answers for decision-making. It generates all possible action choices which will lead to system confusion. In this case, GPT-4V mistakenly recognizes the driving car in "the rightmost lane" which finally generates the wrong decision.
  • Figure 3: Overview of GPT-4V preprocessing with three questions in Chain-of-Thought design.
  • Figure 4: Examples of enriched descriptions from GPT-4V on BDD-OIA dataset. The first column shows the original BDD-OIA explanations have the wrong explanation("Traffic light is green") and missing details. The supplementary descriptions from GPT-4V in the second column demonstrate the reasons why "stop/slow down" and not "left turn".
  • Figure 5: (a) Sample size per super category (b) Distribution of annotation number per sample on enriched BDD-OIA.
  • ...and 3 more figures