Table of Contents
Fetching ...

Knowledge Transfer from Simple to Complex: A Safe and Efficient Reinforcement Learning Framework for Autonomous Driving Decision-Making

Rongliang Zhou, Jiakun Huang, Mingjun Li, Hepeng Li, Haotian Cao, Xiaolin Song

TL;DR

Simulation experiments in highway lane-change scenarios show that the S2CD framework enhances learning efficiency, reduces training costs, and significantly improves safety compared to state-of-the-art algorithms.

Abstract

A safe and efficient decision-making system is crucial for autonomous vehicles. However, the complexity of driving environments limits the effectiveness of many rule-based and machine learning approaches. Reinforcement Learning (RL), with its robust self-learning capabilities and environmental adaptability, offers a promising solution to these challenges. Nevertheless, safety and efficiency concerns during training hinder its widespread application. To address these concerns, we propose a novel RL framework, Simple to Complex Collaborative Decision (S2CD). First, we rapidly train the teacher model in a lightweight simulation environment. In the more complex and realistic environment, teacher intervenes when the student agent exhibits suboptimal behavior by assessing actions' value to avert dangers. We also introduce an RL algorithm called Adaptive Clipping Proximal Policy Optimization Plus, which combines samples from both teacher and student policies and employs dynamic clipping strategies based on sample importance. This approach improves sample efficiency while effectively alleviating data imbalance. Additionally, we employ the Kullback-Leibler divergence as a policy constraint, transforming it into an unconstrained problem with the Lagrangian method to accelerate the student's learning. Finally, a gradual weaning strategy ensures that the student learns to explore independently over time, overcoming the teacher's limitations and maximizing performance. Simulation experiments in highway lane-change scenarios show that the S2CD framework enhances learning efficiency, reduces training costs, and significantly improves safety compared to state-of-the-art algorithms. This framework also ensures effective knowledge transfer between teacher and student models, even with suboptimal teachers, the student achieves superior performance, demonstrating the robustness and effectiveness of S2CD.

Knowledge Transfer from Simple to Complex: A Safe and Efficient Reinforcement Learning Framework for Autonomous Driving Decision-Making

TL;DR

Simulation experiments in highway lane-change scenarios show that the S2CD framework enhances learning efficiency, reduces training costs, and significantly improves safety compared to state-of-the-art algorithms.

Abstract

A safe and efficient decision-making system is crucial for autonomous vehicles. However, the complexity of driving environments limits the effectiveness of many rule-based and machine learning approaches. Reinforcement Learning (RL), with its robust self-learning capabilities and environmental adaptability, offers a promising solution to these challenges. Nevertheless, safety and efficiency concerns during training hinder its widespread application. To address these concerns, we propose a novel RL framework, Simple to Complex Collaborative Decision (S2CD). First, we rapidly train the teacher model in a lightweight simulation environment. In the more complex and realistic environment, teacher intervenes when the student agent exhibits suboptimal behavior by assessing actions' value to avert dangers. We also introduce an RL algorithm called Adaptive Clipping Proximal Policy Optimization Plus, which combines samples from both teacher and student policies and employs dynamic clipping strategies based on sample importance. This approach improves sample efficiency while effectively alleviating data imbalance. Additionally, we employ the Kullback-Leibler divergence as a policy constraint, transforming it into an unconstrained problem with the Lagrangian method to accelerate the student's learning. Finally, a gradual weaning strategy ensures that the student learns to explore independently over time, overcoming the teacher's limitations and maximizing performance. Simulation experiments in highway lane-change scenarios show that the S2CD framework enhances learning efficiency, reduces training costs, and significantly improves safety compared to state-of-the-art algorithms. This framework also ensures effective knowledge transfer between teacher and student models, even with suboptimal teachers, the student achieves superior performance, demonstrating the robustness and effectiveness of S2CD.

Paper Structure

This paper contains 35 sections, 4 theorems, 37 equations, 6 figures, 17 tables, 2 algorithms.

Key Result

Theorem 1

With the switch function, the return $J(\theta_\text{mix})$ is bounded both below and above by: where $H = \mathbb{E}_{s \sim d_{\pi^{\text{mix}}}} \mathcal{H}(\pi^{t}(\cdot | s))$ represents the average entropy of the teacher policy, $\kappa$ is a small error term, $R_{\max}$ represents the maximum reward value, and $\omega$ is the intervention rate determined by the switch function.

Figures (6)

  • Figure 1: The S2CD framework consists of a teacher training module, a high-level decision-making layer, and a lower-level layer, enhanced by 4 innovative modules to improve learning efficiency and performance: 1. Employing the teacher model for action intervention and demonstration to enhance the safety of the student policy; 2. Utilizing dual-source data for training to improve sample efficiency; 3. Employing KL divergence constraints in policy updates to enable the student policy to quickly approach the teacher policy; 4. Gradually reducing the teacher model's intervention via a weaning mechanism to prevent excessive reliance of the student on the teacher.
  • Figure 2: Medium-density traffic scenario with 3-lanes
  • Figure 3: The training curves for all algorithms are shown. Every algorithm is trained using three random seeds, accumulating 500K training steps in every case, except for S2CD, which needs only 300K steps. To evaluate the performance of the algorithms at each stage, two evaluation episodes are conducted every 5,000 training steps, and the average value of these episodes is recorded as the result.
  • Figure 4: The curves of model evaluation for all algorithms. Each model trained with varying random seeds is evaluated twice, and each algorithm undergoes six evaluation runs, with each evaluation consisting of 200 episodes.
  • Figure 5: The training and evaluation processes involve different teacher models: 1. Model Training: Every algorithm is trained with three random seeds, accumulating 300K training steps per case. Two evaluation episodes are conducted every 5,000 training steps, and their average value is recorded as the result. Throughout model training, we monitor the return value, safety cost, and collision counts. 2. Model Evaluation: Every model trained with different random seeds is evaluated twice. In total, every algorithm undergoes six evaluation runs, with each run consisting of 200 episodes. During model evaluation, the success rate is recorded.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Theorem 2
  • Theorem 3: Restatement of Theorem 3
  • proof
  • Theorem 4: Restatement of Theorem 4
  • proof