Table of Contents
Fetching ...

CLIP-RLDrive: Human-Aligned Autonomous Driving via CLIP-Based Reward Shaping in Reinforcement Learning

Erfan Doroudian, Hamid Taghavifar

TL;DR

This work tackles autonomous driving decision-making at urban unsignalized intersections where traditional reward design is challenging. It proposes CLIP-RLDrive, a framework that uses a CLIP-based reward to shape RL policies alongside conventional environment rewards in DQN and PPO. The authors collect a 500-image, 500-instruction dataset and apply transfer learning by updating only the upper layer of the ViT-B/32 encoder, implementing the dual reward $R_{Final} = R_{Basic} + W_c \times R_{CLIP}$ with $W_c = 1.2$. Experiments in highway-env demonstrate that CLIP-guided agents, particularly CLIP-DQN, achieve higher success and safer behavior than baselines, indicating that vision-language guidance can produce human-aligned driving.

Abstract

This paper presents CLIP-RLDrive, a new reinforcement learning (RL)-based framework for improving the decision-making of autonomous vehicles (AVs) in complex urban driving scenarios, particularly in unsignalized intersections. To achieve this goal, the decisions for AVs are aligned with human-like preferences through Contrastive Language-Image Pretraining (CLIP)-based reward shaping. One of the primary difficulties in RL scheme is designing a suitable reward model, which can often be challenging to achieve manually due to the complexity of the interactions and the driving scenarios. To deal with this issue, this paper leverages Vision-Language Models (VLMs), particularly CLIP, to build an additional reward model based on visual and textual cues.

CLIP-RLDrive: Human-Aligned Autonomous Driving via CLIP-Based Reward Shaping in Reinforcement Learning

TL;DR

This work tackles autonomous driving decision-making at urban unsignalized intersections where traditional reward design is challenging. It proposes CLIP-RLDrive, a framework that uses a CLIP-based reward to shape RL policies alongside conventional environment rewards in DQN and PPO. The authors collect a 500-image, 500-instruction dataset and apply transfer learning by updating only the upper layer of the ViT-B/32 encoder, implementing the dual reward with . Experiments in highway-env demonstrate that CLIP-guided agents, particularly CLIP-DQN, achieve higher success and safer behavior than baselines, indicating that vision-language guidance can produce human-aligned driving.

Abstract

This paper presents CLIP-RLDrive, a new reinforcement learning (RL)-based framework for improving the decision-making of autonomous vehicles (AVs) in complex urban driving scenarios, particularly in unsignalized intersections. To achieve this goal, the decisions for AVs are aligned with human-like preferences through Contrastive Language-Image Pretraining (CLIP)-based reward shaping. One of the primary difficulties in RL scheme is designing a suitable reward model, which can often be challenging to achieve manually due to the complexity of the interactions and the driving scenarios. To deal with this issue, this paper leverages Vision-Language Models (VLMs), particularly CLIP, to build an additional reward model based on visual and textual cues.

Paper Structure

This paper contains 17 sections, 15 equations, 14 figures, 4 tables, 2 algorithms.

Figures (14)

  • Figure 1: A general block diagram of the proposed framework based on PPO.
  • Figure 2: DQN: Convolutional Neural Network with the last 4 frames.
  • Figure 3: A summary of the (scenario, instruction) pairs in various driving situations at an unsignalized intersection, used for calibrating the CLIP model.
  • Figure 4: Architecture for CLIP as a reward model.
  • Figure 5: The change of average reward per episode through the training process for CLIP-based and non-CLIP-based DQN, and PPO.
  • ...and 9 more figures