VLMs Guided Interpretable Decision Making for Autonomous Driving
Xin Hu, Taotao Jing, Renran Tian, Zhengming Ding
TL;DR
The paper addresses the unreliability of direct decision-making by vision-language models in autonomous driving and proposes using VLMs as semantic enhancers that generate linguistically rich, spatially grounded scene descriptions. It introduces a dual-branch cross-attention architecture that fuses enriched textual descriptions with visual cues, trained via multi-instance learning, and complemented by a post-hoc VLM-based refinement module. Through experiments on the BDD-OIA and PSI driving benchmarks, the approach achieves state-of-the-art performance while providing interpretable decision explanations through CAMs and textual cues. The work demonstrates that semantic enrichment and structured multimodal fusion can yield more robust, explainable autonomous driving systems, highlighting a path toward reliable real-world deployment.
Abstract
Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.
