Table of Contents
Fetching ...

Human-aligned Safe Reinforcement Learning for Highway On-Ramp Merging in Dense Traffic

Yang Li, Shijie Yuan, Yuan Chang, Xiaolong Chen, Qisong Yang, Zhiyuan Yang, Hongmao Qin

TL;DR

This work tackles the safety-while-optimizing challenge of highway on-ramp merging by introducing a human-aligned safe reinforcement learning framework. High-level decisions are cast as a CMDP with risk-preference constraints, solved via a Lagrangian SAC, while MPC handles low-level planning; an action-shielding module pre-checks RL actions using motion predictions to ensure safety. Safety limits are dynamically set through a fuzzy controller that maps user risk preferences and traffic density to a cost cap, and theoretical analysis substantiates the shielding mechanism's safety and learning efficiency benefits. Empirical results across varying traffic densities show reduced safety violations without compromising efficiency, and ablations confirm the value of n-step TD, ASM, and CMDP constraints for training stability and performance. The approach offers a practical pathway for online learning in real-world autonomous driving by aligning policies with human risk preferences and enforcing hard safety via shielding.

Abstract

Most reinforcement learning (RL) approaches for the decision-making of autonomous driving consider safety as a reward instead of a cost, which makes it hard to balance the tradeoff between safety and other objectives. Human risk preference has also rarely been incorporated, and the trained policy might be either conservative or aggressive for users. To this end, this study proposes a human-aligned safe RL approach for autonomous merging, in which the high-level decision problem is formulated as a constrained Markov decision process (CMDP) that incorporates users' risk preference into the safety constraints, followed by a model predictive control (MPC)-based low-level control. The safety level of RL policy can be adjusted by computing cost limits of CMDP's constraints based on risk preferences and traffic density using a fuzzy control method. To filter out unsafe or invalid actions, we design an action shielding mechanism that pre-executes RL actions using an MPC method and performs collision checks with surrounding agents. We also provide theoretical proof to validate the effectiveness of the shielding mechanism in enhancing RL's safety and sample efficiency. Simulation experiments in multiple levels of traffic densities show that our method can significantly reduce safety violations without sacrificing traffic efficiency. Furthermore, due to the use of risk preference-aware constraints in CMDP and action shielding, we can not only adjust the safety level of the final policy but also reduce safety violations during the training stage, proving a promising solution for online learning in real-world environments.

Human-aligned Safe Reinforcement Learning for Highway On-Ramp Merging in Dense Traffic

TL;DR

This work tackles the safety-while-optimizing challenge of highway on-ramp merging by introducing a human-aligned safe reinforcement learning framework. High-level decisions are cast as a CMDP with risk-preference constraints, solved via a Lagrangian SAC, while MPC handles low-level planning; an action-shielding module pre-checks RL actions using motion predictions to ensure safety. Safety limits are dynamically set through a fuzzy controller that maps user risk preferences and traffic density to a cost cap, and theoretical analysis substantiates the shielding mechanism's safety and learning efficiency benefits. Empirical results across varying traffic densities show reduced safety violations without compromising efficiency, and ablations confirm the value of n-step TD, ASM, and CMDP constraints for training stability and performance. The approach offers a practical pathway for online learning in real-world autonomous driving by aligning policies with human risk preferences and enforcing hard safety via shielding.

Abstract

Most reinforcement learning (RL) approaches for the decision-making of autonomous driving consider safety as a reward instead of a cost, which makes it hard to balance the tradeoff between safety and other objectives. Human risk preference has also rarely been incorporated, and the trained policy might be either conservative or aggressive for users. To this end, this study proposes a human-aligned safe RL approach for autonomous merging, in which the high-level decision problem is formulated as a constrained Markov decision process (CMDP) that incorporates users' risk preference into the safety constraints, followed by a model predictive control (MPC)-based low-level control. The safety level of RL policy can be adjusted by computing cost limits of CMDP's constraints based on risk preferences and traffic density using a fuzzy control method. To filter out unsafe or invalid actions, we design an action shielding mechanism that pre-executes RL actions using an MPC method and performs collision checks with surrounding agents. We also provide theoretical proof to validate the effectiveness of the shielding mechanism in enhancing RL's safety and sample efficiency. Simulation experiments in multiple levels of traffic densities show that our method can significantly reduce safety violations without sacrificing traffic efficiency. Furthermore, due to the use of risk preference-aware constraints in CMDP and action shielding, we can not only adjust the safety level of the final policy but also reduce safety violations during the training stage, proving a promising solution for online learning in real-world environments.

Paper Structure

This paper contains 55 sections, 55 equations, 16 figures, 6 tables, 3 algorithms.

Figures (16)

  • Figure 1: Illustration of the highway on-ramp merging scenario. The ego vehicle (green) starts on a one-lane entrance ramp and needs to merge into the highway traffic safely and efficiently.
  • Figure 2: Overview of the proposed method. The high-level decision problem is formulated as a CMDP that incorporates individuals' risk preferences into the constraints, followed by an MPC-based low-level control. A Lagrangian-based SAC algorithm is used to solve CMDP for the optimal RL policy. The RL action is a discrete action such as left, right, acceleration, etc. To filter out unsafe or invalid RL actions, we design an action shielding mechanism to mask out risky ones by pre-executing the action with MPC and conducting collision constraint checks. Then, the safe RL action is sent to the low-level MPC, which generates the vehicle control (the acceleration and steering angle) for the simulation environment. The simulation environment finally generates the state, reward, and cost data for training the high-level RL agent. The RL agent learns to act by trial and error, and the safety violations during the training process can be reduced using our method.
  • Figure 3: Cost design based on motion predictions of the ego vehicle and surrounding objects. There are three typical situations, including (a) failing to merge before reaching the end of the road, (b) colliding with other vehicles, and (c) being hard to merge when the target lane is occupied by other vehicles.
  • Figure 4: The membership functions of the fuzzy inputs risk level and traffic density, and the fuzzy output cost limit. (a) risk level, varying between 0 and 100%. (b) traffic density, varying between 0.5 and 1. (c) cost limit, varying between 0 and 0.1.
  • Figure 5: Illustration of the aggregation and defuzzification of the Mamdani inference process. The fuzzy sets of the cost limit include large, medium, and small, which are represented with black, red, and blue dash lines. Given that the traffic density and risk level are 0.57 and 45$\%$, and the membership value of the input is $\tilde{A} = \{\text{Low}, \text{Medium}\}$ and $\tilde{B} = \{\text{Conservative}, \text{Neutral}\}$, we can compute the fuzzy output for the cost limit of small, medium, and large, which are 0.25, 0.35, and 0.65, respectively. With the fuzzy output, we can compute the grey area for each fuzzy set and aggregate them into a single fuzzy set, which is the union area of three grey areas. Finally, we take the centroid of the grey area to obtain a crisp value for the cost limit, which is 0.0595. That is, with the inputs of traffic density and risk level of 0.57 and 45%, the cost limit is set as 0.0595.
  • ...and 11 more figures