Table of Contents
Fetching ...

Dynamic Residual Safe Reinforcement Learning for Multi-Agent Safety-Critical Scenarios Decision-Making

Kaifeng Wang, Yinsong Chen, Qi Liu, Xueyuan Li, Xin Gao

TL;DR

The paper tackles safety-critical decision-making in multi-agent autonomous driving by introducing Dynamic Residual Safe Reinforcement Learning (DRS-RL) built on a Safety-Enhanced Networked MDP and a Multi-Agent Dynamic Conflict Zone (MADCZ) model. The core idea is a dual-policy framework where a lightweight safety policy provides residual corrections to a task policy via a dynamic weighting factor $\alpha_t$, fused as $A = A^{task} + \alpha_t (A^{safe} - A^{task})$, enabling real-time risk management without sacrificing performance. A Risk-Aware Prioritized Experience Replay maps real-time risk to sampling probability to counter data distribution biases against safety-critical episodes. Experimental results on a Bench2Drive-derived MASCS set show substantial safety gains (e.g., up to $92.17\%$ collision reduction) and competitive efficiency and comfort, validating the effectiveness and parameter efficiency of the proposed approach for multi-agent safety-critical decision-making.

Abstract

In multi-agent safety-critical scenarios, traditional autonomous driving frameworks face significant challenges in balancing safety constraints and task performance. These frameworks struggle to quantify dynamic interaction risks in real-time and depend heavily on manual rules, resulting in low computational efficiency and conservative strategies. To address these limitations, we propose a Dynamic Residual Safe Reinforcement Learning (DRS-RL) framework grounded in a safety-enhanced networked Markov decision process. It's the first time that the weak-to-strong theory is introduced into multi-agent decision-making, enabling lightweight dynamic calibration of safety boundaries via a weak-to-strong safety correction paradigm. Based on the multi-agent dynamic conflict zone model, our framework accurately captures spatiotemporal coupling risks among heterogeneous traffic participants and surpasses the static constraints of conventional geometric rules. Moreover, a risk-aware prioritized experience replay mechanism mitigates data distribution bias by mapping risk to sampling probability. Experimental results reveal that the proposed method significantly outperforms traditional RL algorithms in safety, efficiency, and comfort. Specifically, it reduces the collision rate by up to 92.17%, while the safety model accounts for merely 27% of the main model's parameters.

Dynamic Residual Safe Reinforcement Learning for Multi-Agent Safety-Critical Scenarios Decision-Making

TL;DR

The paper tackles safety-critical decision-making in multi-agent autonomous driving by introducing Dynamic Residual Safe Reinforcement Learning (DRS-RL) built on a Safety-Enhanced Networked MDP and a Multi-Agent Dynamic Conflict Zone (MADCZ) model. The core idea is a dual-policy framework where a lightweight safety policy provides residual corrections to a task policy via a dynamic weighting factor , fused as , enabling real-time risk management without sacrificing performance. A Risk-Aware Prioritized Experience Replay maps real-time risk to sampling probability to counter data distribution biases against safety-critical episodes. Experimental results on a Bench2Drive-derived MASCS set show substantial safety gains (e.g., up to collision reduction) and competitive efficiency and comfort, validating the effectiveness and parameter efficiency of the proposed approach for multi-agent safety-critical decision-making.

Abstract

In multi-agent safety-critical scenarios, traditional autonomous driving frameworks face significant challenges in balancing safety constraints and task performance. These frameworks struggle to quantify dynamic interaction risks in real-time and depend heavily on manual rules, resulting in low computational efficiency and conservative strategies. To address these limitations, we propose a Dynamic Residual Safe Reinforcement Learning (DRS-RL) framework grounded in a safety-enhanced networked Markov decision process. It's the first time that the weak-to-strong theory is introduced into multi-agent decision-making, enabling lightweight dynamic calibration of safety boundaries via a weak-to-strong safety correction paradigm. Based on the multi-agent dynamic conflict zone model, our framework accurately captures spatiotemporal coupling risks among heterogeneous traffic participants and surpasses the static constraints of conventional geometric rules. Moreover, a risk-aware prioritized experience replay mechanism mitigates data distribution bias by mapping risk to sampling probability. Experimental results reveal that the proposed method significantly outperforms traditional RL algorithms in safety, efficiency, and comfort. Specifically, it reduces the collision rate by up to 92.17%, while the safety model accounts for merely 27% of the main model's parameters.

Paper Structure

This paper contains 20 sections, 1 theorem, 16 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

If the following conditions are satisfied: Then there exists a Lyapunov function $V(s) = {\mathbb{E}}[{Q^{{\mathrm{safe}}}}(s,a)] + \lambda D_{\mathrm{KL}}({\pi ^{{\mathrm{hybrid}}}}\parallel {\pi ^{{\mathrm{task}}}})$, such that

Figures (6)

  • Figure 1: An illustration of the proposed methodology. Our method is inspired by the weak-to-strong correction weak-to-strong-correction and introduces a lightweight safety model to balance performance and safety. It enables the agent to evolve into a safer and more robust entity while preserving its original performance.
  • Figure 2: Overall architecture of the proposed method. Multi-agent safety-critical scenarios are modeled as a dynamic conflict zone. Based on this representation, the DRS-RL algorithm generates hybrid strategies through the weak-to-strong safety correction paradigm. The task and safety models are optimized through the risk-aware PER method and dual-reward collaborative optimization, emphasizing the learning for safety-critical segments.
  • Figure 3: Safety-Enhanced Networked Markov Decision Process.
  • Figure 4: Multi-agent safety-critical scenario set. The first scenario is LVEB, where the AV should perform emergency braking or collaborate with the rear-side vehicle. The second scenario is OPI, where a pedestrian enters the lane from the blind spot. The third scenario is RPC, where AVs are traveling along a lane adjacent to a roadside parking lot, and a parked vehicle suddenly cuts in. The fourth scenario is IJ, where a group of pedestrians crosses the road at the intersection, significantly increasing the collision risk. Additionally, each scenario incorporates random variable parameters (including obstacle position, pedestrian triggering conditions, etc.) to create various variant scenarios.
  • Figure 5: Reward curves of the four multi-agent safety-critical scenarios. The shaded areas show the standard deviation for 5 random seeds.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1: Safety-Enhanced Networked MDP
  • Definition 2: Hybrid Policy
  • Theorem 1: Safety Residual Convergence