Table of Contents
Fetching ...

DriveMind: A Dual Visual Language Model-based Reinforcement Learning Framework for Autonomous Driving

Dawood Wasif, Terrence J. Moore, Chandan K. Reddy, Frederica Free-Nelson, Seunghyun Yoon, Hyuk Lim, Dan Dongseong Kim, Jin-Hee Cho

Abstract

End-to-end autonomous driving systems map sensor data directly to control commands, but remain opaque, lack interpretability, and offer no formal safety guarantees. While recent vision-language-guided reinforcement learning (RL) methods introduce semantic feedback, they often rely on static prompts and fixed objectives, limiting adaptability to dynamic driving scenes. We present DriveMind, a unified semantic reward framework that integrates: (i) a contrastive Vision-Language Model (VLM) encoder for stepwise semantic anchoring; (ii) a novelty-triggered VLM encoder-decoder, fine-tuned via chain-of-thought (CoT) distillation, for dynamic prompt generation upon semantic drift; (iii) a hierarchical safety module enforcing kinematic constraints (e.g., speed, lane centering, stability); and (iv) a compact predictive world model to reward alignment with anticipated ideal states. DriveMind achieves 19.4 +/- 2.3 km/h average speed, 0.98 +/- 0.03 route completion, and near-zero collisions in CARLA Town 2, outperforming baselines by over 4% in success rate. Its semantic reward generalizes zero-shot to real dash-cam data with minimal distributional shift, demonstrating robust cross-domain alignment and potential for real-world deployment.

DriveMind: A Dual Visual Language Model-based Reinforcement Learning Framework for Autonomous Driving

Abstract

End-to-end autonomous driving systems map sensor data directly to control commands, but remain opaque, lack interpretability, and offer no formal safety guarantees. While recent vision-language-guided reinforcement learning (RL) methods introduce semantic feedback, they often rely on static prompts and fixed objectives, limiting adaptability to dynamic driving scenes. We present DriveMind, a unified semantic reward framework that integrates: (i) a contrastive Vision-Language Model (VLM) encoder for stepwise semantic anchoring; (ii) a novelty-triggered VLM encoder-decoder, fine-tuned via chain-of-thought (CoT) distillation, for dynamic prompt generation upon semantic drift; (iii) a hierarchical safety module enforcing kinematic constraints (e.g., speed, lane centering, stability); and (iv) a compact predictive world model to reward alignment with anticipated ideal states. DriveMind achieves 19.4 +/- 2.3 km/h average speed, 0.98 +/- 0.03 route completion, and near-zero collisions in CARLA Town 2, outperforming baselines by over 4% in success rate. Its semantic reward generalizes zero-shot to real dash-cam data with minimal distributional shift, demonstrating robust cross-domain alignment and potential for real-world deployment.

Paper Structure

This paper contains 71 sections, 50 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Dual‑VLM architecture and reward pipeline of DriveMind: A static contrastive VLM encoder maps each bird’s-eye-view image $\psi(s_t)$ to a fixed semantic embedding $v_t$, while a novelty detector asynchronously triggers a dynamic VLM encoder-decoder to update “present” and “ideal” prompt embeddings $(u^P_t,u^I_t)$ via distilled GPT‑4 chain-of-thought. These embeddings feed into the Reward Combiner, yielding (i) Adaptive Ideal-State Contrastive Reward, (ii) Hierarchical Vehicle-State Fusion Reward, and (iii) Predictive Contrastive Foresight Reward.
  • Figure 2: Sample ground-truth labels from GPT-4 teacher for chain-of-thought distillation in dynamic VLM: Examples from (i) a quiet residential street and (ii) a tight urban intersection. Each shows a bird’s-eye-view scene and the generated outputs: Scene Overview, Risk Assessment, Guidance Summary, and the present and ideal prompts used as distillation targets for producing the Adaptive ideal-State Contrastive Reward (AICR).
  • Figure 3: System, Training, and Inference Prompts used for Chain-of-Thought Distillation and Dynamic VLM Fine-Tuning.
  • Figure 4: Three driving scenarios illustrating how chain-of-thought prompting structures semantic reward generation. Each box shows the CoT breakdown which includes: Scene Overview, Risk Assessment, and Guidance Summary, followed by the negative “present Prompt” and positive “ideal Prompt” used to compute the Adaptive ideal-State Contrastive Reward.
  • Figure 5: DriveMind Training Dynamics over 1M Timesteps.Top row (red): Collision statistics, the cumulative number of collisions (collision_num), instantaneous collision rate (collision_rate), and the average interval between collisions (collision_interval). Middle row (orange): Driving performance metrics, number of completed routes per episode (routes_completed), average speed (avg_speed), and mean lateral deviation from lane center (avg_center_dev). Bottom row (pink): Learning signals, the total episodic reward (total_reward), mean per-step reward (mean_reward), and the SAC critic’s Bellman-error loss (critic_loss).