Probabilistic Constraint for Safety-Critical Reinforcement Learning

Weiqin Chen; Dharmashankar Subramanian; Santiago Paternain

Probabilistic Constraint for Safety-Critical Reinforcement Learning

Weiqin Chen, Dharmashankar Subramanian, Santiago Paternain

TL;DR

This work provides an improved gradient SPG-Actor–Critic that leads to a lower variance than SPG-REINFORCE, which is substantiated by the theoretical results, and proposes a safe primal-dual algorithm that can leverage both SPGs to learn safe policies.

Abstract

In this paper, we consider the problem of learning safe policies for probabilistic-constrained reinforcement learning (RL). Specifically, a safe policy or controller is one that, with high probability, maintains the trajectory of the agent in a given safe set. We establish a connection between this probabilistic-constrained setting and the cumulative-constrained formulation that is frequently explored in the existing literature. We provide theoretical bounds elucidating that the probabilistic-constrained setting offers a better trade-off in terms of optimality and safety (constraint satisfaction). The challenge encountered when dealing with the probabilistic constraints, as explored in this work, arises from the absence of explicit expressions for their gradients. Our prior work provides such an explicit gradient expression for probabilistic constraints which we term Safe Policy Gradient-REINFORCE (SPG-REINFORCE). In this work, we provide an improved gradient SPG-Actor-Critic that leads to a lower variance than SPG-REINFORCE, which is substantiated by our theoretical results. A noteworthy aspect of both SPGs is their inherent algorithm independence, rendering them versatile for application across a range of policy-based algorithms. Furthermore, we propose a Safe Primal-Dual algorithm that can leverage both SPGs to learn safe policies. It is subsequently followed by theoretical analyses that encompass the convergence of the algorithm, as well as the near-optimality and feasibility on average. In addition, we test the proposed approaches by a series of empirical experiments. These experiments aim to examine and analyze the inherent trade-offs between the optimality and safety, and serve to substantiate the efficacy of two SPGs, as well as our theoretical contributions.

Probabilistic Constraint for Safety-Critical Reinforcement Learning

TL;DR

Abstract

Paper Structure (14 sections, 15 theorems, 118 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 14 sections, 15 theorems, 118 equations, 7 figures, 1 table, 1 algorithm.

Introduction
Problem Formulation
Properties of Safe Reinforcement Learning
Safe Policy Gradients
Learning Safe Policies
Numerical Results
Navigation in a cluttered environment
Lunar Lander
Safety Gym
Conclusions
Appendix
Technical Lemmas for the Proof of Theorem \ref{['theorem_P_star_Ptilder_star']}
Proof of Theorem \ref{['theorem_safe_policy_gradient']}
Technical Lemmas used in Section \ref{['Losses_of_Imposing_Relaxation']}

Key Result

Proposition 1

Denote by $\tilde{\theta}$ a feasible solution to problem eqn_problem2_mirror. Then, $\tilde{\theta}$ is a feasible solution to problem eqn_problem1 as well, i.e., the policy induced by $\tilde{\theta}$ guarantees safety in the sense of Definition definition_safety.

Figures (7)

Figure 1: Navigation policy learned after 250,000 episodes of training employing Algorithm \ref{['alg_pd']} with SPG-REINFORCE and the practical version in Step 4 where $\eta_\theta = 0.02$, $\eta_\lambda = 0.002$, $1-\delta=0.95$. The agent is trained to navigate starting from (1, 1), (1, 9), (2, 5), (8, 9) to the goal (8.5, 1.5).
Figure 2: Implementation of Algorithm \ref{['alg_pd']} with the practical version in Step 4 for SPG-REINFORCE and SPG-Actor-Critic over 5 independent runs. Return (normalized by time horizon $T$), safety, and dual variable $\lambda$ are averaged over time. The hyper-parameters in both cases are set up as $\eta_\theta = 0.02, \eta_\lambda = 0.002, 1-\delta=0.95, T=20$. Number of episodes is set to be 250,000. The solid line shows the mean and the shaded area depicts the standard deviation.
Figure 3: In both (a) and (b), depicted in blue dots is the safety-return for the cumulative-constrained formulation for different values ($\mu\in[40, 250]$ for (a) and $\mu\in[50, 3000]$ for (b)) and depicted in blue triangles is the corresponding upper bound given by Theorem \ref{['theorem_P_star_Ptilder_star']}. Depicted in red dots is the safety-return for the probabilistic-constrained formulation for different values ($\lambda\in[10, 50]$ for (a) and $\lambda\in[50, 1500]$ for (b)). Both (a) and (b) observe that the probabilistic-constrained cases trace a better trade-off of evaluation return and safety, and are upper bounded by the blue triangles. Each large dot represents the mean under a fixed $\lambda$ or $\mu$. Each evaluation return (small dot) is averaged over 500 independent evaluations and the return is normalized by time horizon $T$.
Figure 4: Visualization of $\emph{LunarLander-v2}$. The state space consists of 6 continuous variables: horizontal coordinate $x$, vertical coordinate $y$, horizontal velocity $\dot{x}$, vertical velocity $\dot{y}$, angle $\theta$, angular velocity $\dot{\theta}$ and two Booleans for left and right legs indicating whether in contact with the ground. The goal position is fixed at (0, 0). The orange illustrates the main and side thrusts and gravity force configuration, in which $a$ is the offset between side thrust axes and lander's center.
Figure 5: Learning of $\emph{LunarLander-v2}$, averaged over 10 runs for each method. (a). Evolution of running average return over 100 consecutive landing attempts. The solid lines show the mean and the shaded areas depict the standard deviation. (b). Comparison of safety probability. The safety probability of each run, denoted by the probability of $\{v < 0.9\}$ (desired velocity threshold), is computed by the number of safe evaluations divided by 200 independent evaluations, and the dash lines represent the mean of safety probability over 10 runs. Algorithm \ref{['alg_pd']} and classical primal-dual method with $1-\delta=95\%$ are implemented in all experiments for the probabilistic-constrained and cumulative-constrained formulations, respectively.
...and 2 more figures

Theorems & Definitions (28)

Definition 1
Proposition 1: paternain2022safe, Theorem 1
Theorem 1
proof
Corollary 1
proof
Theorem 2
proof
Corollary 2
proof
...and 18 more

Probabilistic Constraint for Safety-Critical Reinforcement Learning

TL;DR

Abstract

Probabilistic Constraint for Safety-Critical Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (28)