Table of Contents
Fetching ...

VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving

Zilin Huang, Zihao Sheng, Yansong Qu, Junwei You, Sikai Chen

TL;DR

This work tackles the longstanding challenge of reward design in reinforcement learning for autonomous driving by introducing VLM-RL, which leverages pre-trained vision-language models to generate semantic rewards through a contrasting language goal paradigm. It combines positive and negative language goals with a hierarchical reward synthesis that also incorporates vehicle state signals, and adds a batch-processing scheme to maintain training efficiency. Through extensive CARLA experiments, VLM-RL demonstrates superior safety, route completion, and generalization compared to expert-designed and LM-based baselines, and proves compatibility with multiple RL algorithms. The results suggest that integrating VLMs into end-to-end driving pipelines can yield more informative, robust, and scalable learning signals for safe autonomous navigation. Future directions include improving inference efficiency, expanding driving tasks, and exploring human-in-the-loop or sim-to-real transfer to bridge simulation and real-world deployment.

Abstract

In recent years, reinforcement learning (RL)-based methods for learning driving policies have gained increasing attention in the autonomous driving community and have achieved remarkable progress in various driving scenarios. However, traditional RL approaches rely on manually engineered rewards, which require extensive human effort and often lack generalizability. To address these limitations, we propose \textbf{VLM-RL}, a unified framework that integrates pre-trained Vision-Language Models (VLMs) with RL to generate reward signals using image observation and natural language goals. The core of VLM-RL is the contrasting language goal (CLG)-as-reward paradigm, which uses positive and negative language goals to generate semantic rewards. We further introduce a hierarchical reward synthesis approach that combines CLG-based semantic rewards with vehicle state information, improving reward stability and offering a more comprehensive reward signal. Additionally, a batch-processing technique is employed to optimize computational efficiency during training. Extensive experiments in the CARLA simulator demonstrate that VLM-RL outperforms state-of-the-art baselines, achieving a 10.5\% reduction in collision rate, a 104.6\% increase in route completion rate, and robust generalization to unseen driving scenarios. Furthermore, VLM-RL can seamlessly integrate almost any standard RL algorithms, potentially revolutionizing the existing RL paradigm that relies on manual reward engineering and enabling continuous performance improvements. The demo video and code can be accessed at: https://zilin-huang.github.io/VLM-RL-website.

VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving

TL;DR

This work tackles the longstanding challenge of reward design in reinforcement learning for autonomous driving by introducing VLM-RL, which leverages pre-trained vision-language models to generate semantic rewards through a contrasting language goal paradigm. It combines positive and negative language goals with a hierarchical reward synthesis that also incorporates vehicle state signals, and adds a batch-processing scheme to maintain training efficiency. Through extensive CARLA experiments, VLM-RL demonstrates superior safety, route completion, and generalization compared to expert-designed and LM-based baselines, and proves compatibility with multiple RL algorithms. The results suggest that integrating VLMs into end-to-end driving pipelines can yield more informative, robust, and scalable learning signals for safe autonomous navigation. Future directions include improving inference efficiency, expanding driving tasks, and exploring human-in-the-loop or sim-to-real transfer to bridge simulation and real-world deployment.

Abstract

In recent years, reinforcement learning (RL)-based methods for learning driving policies have gained increasing attention in the autonomous driving community and have achieved remarkable progress in various driving scenarios. However, traditional RL approaches rely on manually engineered rewards, which require extensive human effort and often lack generalizability. To address these limitations, we propose \textbf{VLM-RL}, a unified framework that integrates pre-trained Vision-Language Models (VLMs) with RL to generate reward signals using image observation and natural language goals. The core of VLM-RL is the contrasting language goal (CLG)-as-reward paradigm, which uses positive and negative language goals to generate semantic rewards. We further introduce a hierarchical reward synthesis approach that combines CLG-based semantic rewards with vehicle state information, improving reward stability and offering a more comprehensive reward signal. Additionally, a batch-processing technique is employed to optimize computational efficiency during training. Extensive experiments in the CARLA simulator demonstrate that VLM-RL outperforms state-of-the-art baselines, achieving a 10.5\% reduction in collision rate, a 104.6\% increase in route completion rate, and robust generalization to unseen driving scenarios. Furthermore, VLM-RL can seamlessly integrate almost any standard RL algorithms, potentially revolutionizing the existing RL paradigm that relies on manual reward engineering and enabling continuous performance improvements. The demo video and code can be accessed at: https://zilin-huang.github.io/VLM-RL-website.

Paper Structure

This paper contains 43 sections, 7 theorems, 31 equations, 16 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Assume the VLM embeddings accurately capture the semantic content of observations and language goals. Under this assumption, optimizing the policy $\pi$ to maximize the CLG reward $R_{\text{CLG}}$ defined in Eq. (eq7) encourages the agent to simultaneously increase similarity to the positive goal an

Figures (16)

  • Figure 1: Comparative Overview of Reward Design Paradigms for Autonomous Driving. (a) Fundamentals and limitations of IL/RL-based methods for driving policy learning. (b) Fundamentals and limitations of foundation model-based reward design methods (i.e., LLM-as-Reward and VLM-as-Reward paradigms) for driving policy learning. (c) Our proposed VLM-RL framework, leverages VLMs to achieve a comprehensive and stable reward design for safe autonomous driving.
  • Figure 2: Architecture of the VLM-RL Framework for Autonomous Driving. (a) Observation and action spaces for policy learning; (b) Definition of CLG to provide semantic guidance; (c) CLG-based semantic reward computation using pre-trained VLMs; (d) Hierarchical reward synthesis that integrates semantic rewards with vehicle state information for comprehensive and stable reward signals; (e) Policy training with batch-processing, where SAC updates are performed using experiences stored in a replay buffer and rewards are computed asynchronously to optimize efficiency.
  • Figure 3: Conceptual comparisons of reward design paradigms. (a) Robotic manipulation tasks often feature well-defined goals (e.g., "Put carrot in bowl"), enabling VLMs to provide clear semantic rewards. (b) Existing methods that use only negative goals (e.g., "two cars have collided") focus on avoidance but lack positive guidance. (c) Our CLG-as-Reward paradigm integrates both positive and negative goals, allowing VLM-RL to deliver more informative semantic guidance for safer, more generalizable driving.
  • Figure 4: Bird's eye view of RL agent's surrounding environment, where the purple vehicle in (a) and the white box in (b) represents the RL agent.
  • Figure 5: Bird's eye view of Towns and their drivable routes in CARLA.
  • ...and 11 more figures

Theorems & Definitions (21)

  • Definition 1: Contrasting Language Goal
  • Definition 2: VLM-as-Reward Paradigm
  • Definition 3: CLG-as-Reward Paradigm
  • Theorem 1: Effectiveness of CLG-as-Reward Paradigm
  • Proof 1
  • Definition 4: Synthesis Reward Function
  • Lemma 1: Lipschitz Continuity of Cosine Similarity
  • Proof 2
  • Theorem 2: Lipschitz Continuity of the CLG Reward Function
  • Proof 3
  • ...and 11 more