DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

Zilin Huang; Zihao Sheng; Zhengyang Wan; Yansong Qu; Junwei You; Sicong Jiang; Sikai Chen

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

Zilin Huang, Zihao Sheng, Zhengyang Wan, Yansong Qu, Junwei You, Sicong Jiang, Sikai Chen

Abstract

Ensuring safe decision-making in autonomous vehicles remains a fundamental challenge despite rapid advances in end-to-end learning approaches. Traditional reinforcement learning (RL) methods rely on manually engineered rewards or sparse collision signals, which fail to capture the rich contextual understanding required for safe driving and make unsafe exploration unavoidable in real-world settings. Recent vision-language models (VLMs) offer promising semantic understanding capabilities; however, their high inference latency and susceptibility to hallucination hinder direct application to real-time vehicle control. To address these limitations, this paper proposes DriveVLM-RL, a neuroscience-inspired framework that integrates VLMs into RL through a dual-pathway architecture for safe and deployable autonomous driving. The framework decomposes semantic reward learning into a Static Pathway for continuous spatial safety assessment using CLIP-based contrasting language goals, and a Dynamic Pathway for attention-gated multi-frame semantic risk reasoning using a lightweight detector and a large VLM. A hierarchical reward synthesis mechanism fuses semantic signals with vehicle states, while an asynchronous training pipeline decouples expensive VLM inference from environment interaction. All VLM components are used only during offline training and are removed at deployment, ensuring real-time feasibility. Experiments in the CARLA simulator show significant improvements in collision avoidance, task success, and generalization across diverse traffic scenarios, including strong robustness under settings without explicit collision penalties. These results demonstrate that DriveVLM-RL provides a practical paradigm for integrating foundation models into autonomous driving without compromising real-time feasibility. Demo video and code are available at: https://zilin-huang.github.io/DriveVLM-RL-website/

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

Abstract

Paper Structure (57 sections, 13 theorems, 38 equations, 15 figures, 9 tables, 1 algorithm)

This paper contains 57 sections, 13 theorems, 38 equations, 15 figures, 9 tables, 1 algorithm.

Introduction
Preliminaries
Markov Decision Process Formulation
VLM-as-Reward Paradigm
Problem Statement
Framework: DriveVLM-RL
Overview
Static Pathway
Static Reward Computation
Theoretical Properties
Dynamic Pathway
Attentional Gate
Multi-Frame Semantic Reasoning
Dynamic Reward Computation
Theoretical Properties
...and 42 more sections

Key Result

Lemma 1

For any observation $o_t$ and CLG pair $(l_{\text{pos}}, l_{\text{neg}})$, the static reward is bounded: $R_{\text{static}}(o_t) \in [-1, 1]$.

Figures (15)

Figure 1: Comparative learning paradigms for autonomous driving. (a) Traditional policy learning approaches, including IL and RL, which rely on expert demonstrations or hand-crafted rewards. (b) Foundation model–based approaches, including VLM-as-Control and VLM-as-Reward paradigms. (c) The proposed DriveVLM-RL framework, which integrates a dual-pathway architecture to enable dynamic, context-aware semantic rewards while remaining real-time deployable.
Figure 2: Neuroscience-inspired motivation of DriveVLM-RL. The framework is inspired by the brain’s habitual and deliberative visual processing: routine scenes are handled by a fast pathway, while safety-critical situations trigger attention and higher-level semantic reasoning, motivating a dual-pathway reward learning design.
Figure 3: Overview of DriveVLM-RL. (a) Static Pathway: CLIP-based semantic alignment with contrasting language goals to provide continuous spatial safety assessment. (b) Dynamic Pathway: an attention-gated mechanism triggers multi-frame LVLM reasoning only in safety-critical situations. (c) Hierarchical reward synthesis: static and dynamic semantic signals are fused and integrated with vehicle-state factors to produce the final shaping reward. (d) Asynchronous training pipeline: reward computation is decoupled from environment interaction and policy learning.
Figure 4: Attention-gated dynamic reward generation in DriveVLM-RL. Routine frames bypass semantic reasoning, while safety-critical frames trigger multi-frame LVLM inference to produce a risk description, which is converted into a dynamic reward via CLIP-based semantic similarity.
Figure 5: Multi-modal observations of the ego vehicle in urban traffic, comprising BEV representation, semantic segmentation, and camera views with diverse traffic participants (signals, motorcyclists, cyclists, and pedestrians).
...and 10 more figures

Theorems & Definitions (28)

Definition 1: Static Contrasting Language Goal
Definition 2: Static Reward
Lemma 1: Boundedness
Lemma 2: Discriminability
Theorem 1: Reward-Induced State Ordering
Definition 3: Attentional Gate
Definition 4: Dynamic Language Goal
Definition 5: Dynamic Reward
Lemma 3: Computational Efficiency
Theorem 2: Information Preservation under Gating
...and 18 more

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

Abstract

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

Authors

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (28)