Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards

Zhaorun Chen; Zhuokai Zhao; Tairan He; Binhao Chen; Xuhao Zhao; Liang Gong; Chengliang Liu

Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards

Zhaorun Chen, Zhuokai Zhao, Tairan He, Binhao Chen, Xuhao Zhao, Liang Gong, Chengliang Liu

TL;DR

Adaptive Chance-constrained Safeguards (ACS) is proposed, an adaptive, model-free safe RL algorithm using the safety recovery rate as a surrogate chance constraint to iteratively ensure safety during exploration and after achieving convergence.

Abstract

Ensuring safety in Reinforcement Learning (RL), typically framed as a Constrained Markov Decision Process (CMDP), is crucial for real-world exploration applications. Current approaches in handling CMDP struggle to balance optimality and feasibility, as direct optimization methods cannot ensure state-wise in-training safety, and projection-based methods correct actions inefficiently through lengthy iterations. To address these challenges, we propose Adaptive Chance-constrained Safeguards (ACS), an adaptive, model-free safe RL algorithm using the safety recovery rate as a surrogate chance constraint to iteratively ensure safety during exploration and after achieving convergence. Theoretical analysis indicates that the relaxed probabilistic constraint sufficiently guarantees forward invariance to the safe set. And extensive experiments conducted on both simulated and real-world safety-critical tasks demonstrate its effectiveness in enforcing safety (nearly zero-violation) while preserving optimality (+23.8%), robustness, and fast response in stochastic real-world settings.

Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards

TL;DR

Abstract

Paper Structure (22 sections, 2 theorems, 13 equations, 6 figures, 3 tables)

This paper contains 22 sections, 2 theorems, 13 equations, 6 figures, 3 tables.

Introduction
Related Work
Safe RL
End-to-end
Direct policy optimization (DPO)
Projection-based methods
Chance-Constrained Safe Control
Preliminaries and Problem Formulation
Markov Decision Process with Safety Constraint
Chance-constrained Safety Probability
Adaptive Chance-Constrained Safeguards
Learning to Recover
Hierarchical Safeguarded Controller
Experiment
Experimental Setup
...and 7 more sections

Key Result

Theorem IV.1

Let $A^{\pi_\theta}_C(x_k, u_k)$ denote the advantage function of control $u_k$ at $x_k$, the sufficient condition that can ensure asymptotic safety satisfaction both in training and after convergence is where $H(\mathcal{F}_i(q))$ is the Hessian of $\mathcal{F}_i(q)$, and $C_i$, $\alpha_i$ denotes the cost function and the tolerance level of the $i^\text{th}$ safety constraint, respectively.

Figures (6)

Figure 1: The proposed adaptive chance constraint. The green-dashed and red circle denotes the current safety cost and unified cost tolerance level respectively. The blue oval denotes the adaptive chance-constrained feasible set. Green/red arrows denote feasible/infeasible actions. When current cost $V^\pi_C(x_k)$ is within tolerance, the agent is encouraged to explore more risky states. Otherwise, the next action is constrained in a more conservative set which satisfies Eq. (\ref{['eqn:sufficient_condition']}), so that long-term safety recovery is certified.
Figure 2: The hierarchical framework of the proposed ACS. A Lagrangian-based upper policy layer first generates a near-optimal initial action $u_0$ by solving Eq. (\ref{['eqn:lagrangian']}), then the quasi-newton-based projection layers iteratively correct it into the safe set that satisfies Eq. (\ref{['eqn:sufficient_condition']}) via efficient back-propagation Eq. (\ref{['eqn:bfgs_update']}), enabling ACS to balance task objective and certified safety by constraining actions in an adaptive feasible set while ensuring immediate response.
Figure 3: (a)-(d): Four simulated safe-critical tasks where we assess five safe RL algorithms; (e): An illustration of safety constraint violation.
Figure 4: The initial and end pose of the robots in real-world Kuka-Pick and InMoov-Stretch.
Figure 5: In-training curves of episodic return $J_r$ (top row), total cost rate $J_C$ (middle row), and temporal safety cost rate $J_{TC}$ (bottom row) w.r.t. the number of interactions of different algorithms on four safety-critical simulation tasks.
...and 1 more figures

Theorems & Definitions (4)

Theorem IV.1
Theorem VI.1
proof
proof

Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards

TL;DR

Abstract

Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)