Efficient Reinforcement Learning for Autonomous Driving with Parameterized Skills and Priors

Letian Wang; Jie Liu; Hao Shao; Wenshuo Wang; Ruobing Chen; Yu Liu; Steven L. Waslander

Efficient Reinforcement Learning for Autonomous Driving with Parameterized Skills and Priors

Letian Wang, Jie Liu, Hao Shao, Wenshuo Wang, Ruobing Chen, Yu Liu, Steven L. Waslander

TL;DR

The paper tackles learning autonomous driving policies in dense, interactive traffic where learning directly over low-level controls is inefficient. It introduces ASAP-RL, which learns over parameterized ego-centric motion skills and leverages expert priors via inverse skill parameter recovery and a double initialization strategy for actor and critic pretraining. The approach yields higher learning efficiency and better driving performance than baselines that use skills or priors separately, demonstrated across highway, intersection, and roundabout scenarios with sparse rewards. By operating in the skill parameter space and exploiting priors, ASAP-RL achieves robust, diverse, and safe driving maneuvers with improved sample efficiency. The work provides practical avenues for deploying RL-based autonomous driving in complex real-world traffic and includes open-source code to facilitate further research.

Abstract

When autonomous vehicles are deployed on public roads, they will encounter countless and diverse driving situations. Many manually designed driving policies are difficult to scale to the real world. Fortunately, reinforcement learning has shown great success in many tasks by automatic trial and error. However, when it comes to autonomous driving in interactive dense traffic, RL agents either fail to learn reasonable performance or necessitate a large amount of data. Our insight is that when humans learn to drive, they will 1) make decisions over the high-level skill space instead of the low-level control space and 2) leverage expert prior knowledge rather than learning from scratch. Inspired by this, we propose ASAP-RL, an efficient reinforcement learning algorithm for autonomous driving that simultaneously leverages motion skills and expert priors. We first parameterized motion skills, which are diverse enough to cover various complex driving scenarios and situations. A skill parameter inverse recovery method is proposed to convert expert demonstrations from control space to skill space. A simple but effective double initialization technique is proposed to leverage expert priors while bypassing the issue of expert suboptimality and early performance degradation. We validate our proposed method on interactive dense-traffic driving tasks given simple and sparse rewards. Experimental results show that our method can lead to higher learning efficiency and better driving performance relative to previous methods that exploit skills and priors differently. Code is open-sourced to facilitate further research.

Efficient Reinforcement Learning for Autonomous Driving with Parameterized Skills and Priors

TL;DR

Abstract

Paper Structure (23 sections, 4 equations, 8 figures, 1 algorithm)

This paper contains 23 sections, 4 equations, 8 figures, 1 algorithm.

Introduction
Related Works
Reinforcement Learning with Skills
Reinforcement Learning with Expert Priors
Approach
Motion Skill Generation
Skill Parameter Recovery
Expert Prior Learning - Actor and Critic Pretraining
RL over parameterized skill with priors
Experiments
Experiment Setup
Environment
Reward Definition
Expert Demonstration Collection
Baselines
...and 8 more sections

Figures (8)

Figure 1: (a) RL-based AVs learning over the control space will exhibit inconsistent action sequences. In comparison, RL over the skill space can generate a sequence of consistent low-level actions with more informative exploration and accelerated reward signaling. (b) The parameterized motion skill provides a key interface for RL agents to explore and learn. (c) The expert demonstration can provide prior knowledge of which regions of the action space are more promising in getting rewards than others, which can accelerate learning.
Figure 2: The pipeline of the proposed ASAP-RL method. An inverse skill parameter recovery method is proposed to convert expert demonstration from control space to skill space. A double initialization method is introduced to initialize both actor and critic to inject the expert's prior knowledge into RL. The RL agent can learn and explore in the skill space instead of the control space while leveraging the expert priors, which leads to high learning efficiency and improved final performance.
Figure 3: An illustration of parameterized motion skill generation process. One motion skill is determined by four skill parameters (shown in blue color) that RL agents directly learn and explore. (a) The path is generated by connecting a start point and an endpoint (parameterized by the lateral position $y_e$ and heading angle $\phi_e$ of the endpoint) by the cubic polynomial. (b) The speed profile is represented by a cubic polynomial within the time window $T$, which is parameterized by the speed $v_0$ and acceleration $a_0$ at the beginning time and $v_e$ and $a_e$ at the end time. (c) Motion skill generated by projecting the integral of the speed profile onto the path.
Figure 4: The birds-eye view (BEV) images used as the observation and policy input. (a) the current scene; (b) road information (dashed line) and navigation lanes (in white color); (c) historical waypoints of the ego vehicle; (d-f) surrounding objects (white rectangles) at time $t$, $t-1$, and $t-2$.
Figure 5: Comparison of our method with baselines on the highway, intersection, and roundabout scenarios. PPO and SAC are classical RL algorithms over control space. Constant SAC repeats the same action for the skill horizon $T$. SPiRL and TaEcRL learn in low-dimension latent skill space and SPiRL also leverages expert priors. As detailed in Sec. \ref{['exp: evaluation stage']}, the performance evaluation follows three stages to gradually distinguish the differences between methods with increasingly more specific metrics: 1) reward; 2) success rate, and road completion ratio; 3) collision rate, and passed car per episode. We only need to inspect metrics in later stages when the methods perform similarly in previous stages. The methods that are outperformed by other methods in previous evaluation stages are marked as dashed lines in later stages. Our ASAP-RL outperforms all other methods, and the margin between ASAP-RL and other methods increases as we move from stage 1 to stage 3.
...and 3 more figures

Efficient Reinforcement Learning for Autonomous Driving with Parameterized Skills and Priors

TL;DR

Abstract

Efficient Reinforcement Learning for Autonomous Driving with Parameterized Skills and Priors

Authors

TL;DR

Abstract

Table of Contents

Figures (8)