Table of Contents
Fetching ...

Safe Exploration in Reinforcement Learning: Training Backup Control Barrier Functions with Zero Training Time Safety Violations

Pedram Rabiee, Amirsaeid Safari

TL;DR

The paper tackles safe exploration in reinforcement learning by combining backup control barrier functions with a model-free neural backup policy, ensuring zero safety violations during training. It constructs an initial conservative forward-invariant subset from multiple backup sets and iteratively enlarges this region via a neural backup policy trained with reinforcement learning, all under a safety filter that keeps trajectories within safe bounds. The main contributions include a neural backup policy architecture that guarantees safety from the outset, a softmax/softmin barrier-function framework to fuse multiple backups, and an adjoint-sensitivity-based method to efficiently compute Lie derivatives for real-time control. Empirically, the authors demonstrate on an inverted pendulum that the expanded forward-invariant set enables exploration of a larger state space, yielding improved performance without compromising safety, highlighting the practical impact for safe RL in actuator-constrained systems.

Abstract

This paper introduces the reinforcement learning backup shield (RLBUS), an algorithm that guarantees safe exploration in reinforcement learning (RL) by incorporating backup control barrier functions (BCBFs). RLBUS constructs an implicit control forward invariant subset of the safe set using multiple backup policies, ensuring safety in the presence of input constraints. While traditional BCBFs often result in conservative control forward-invariant sets due to the design of backup controllers, RLBUS addresses this limitation by leveraging model-free RL to train an additional backup policy, which enlarges the identified control forward invariant subset of the safe set. This approach enables the exploration of larger regions in the state space with zero safety violations during training. The effectiveness of RLBUS is demonstrated on an inverted pendulum example, where the expanded invariant set allows for safe exploration over a broader state space, enhancing performance without compromising safety.

Safe Exploration in Reinforcement Learning: Training Backup Control Barrier Functions with Zero Training Time Safety Violations

TL;DR

The paper tackles safe exploration in reinforcement learning by combining backup control barrier functions with a model-free neural backup policy, ensuring zero safety violations during training. It constructs an initial conservative forward-invariant subset from multiple backup sets and iteratively enlarges this region via a neural backup policy trained with reinforcement learning, all under a safety filter that keeps trajectories within safe bounds. The main contributions include a neural backup policy architecture that guarantees safety from the outset, a softmax/softmin barrier-function framework to fuse multiple backups, and an adjoint-sensitivity-based method to efficiently compute Lie derivatives for real-time control. Empirically, the authors demonstrate on an inverted pendulum that the expanded forward-invariant set enables exploration of a larger state space, yielding improved performance without compromising safety, highlighting the practical impact for safe RL in actuator-constrained systems.

Abstract

This paper introduces the reinforcement learning backup shield (RLBUS), an algorithm that guarantees safe exploration in reinforcement learning (RL) by incorporating backup control barrier functions (BCBFs). RLBUS constructs an implicit control forward invariant subset of the safe set using multiple backup policies, ensuring safety in the presence of input constraints. While traditional BCBFs often result in conservative control forward-invariant sets due to the design of backup controllers, RLBUS addresses this limitation by leveraging model-free RL to train an additional backup policy, which enlarges the identified control forward invariant subset of the safe set. This approach enables the exploration of larger regions in the state space with zero safety violations during training. The effectiveness of RLBUS is demonstrated on an inverted pendulum example, where the expanded invariant set allows for safe exploration over a broader state space, enhancing performance without compromising safety.
Paper Structure (10 sections, 8 theorems, 29 equations, 3 figures)

This paper contains 10 sections, 8 theorems, 29 equations, 3 figures.

Key Result

Proposition 1

Consider eq:dynamics where Assumptions assum:ub and assum:j_singleton are satisfied. Let $x_0 \in \bar{{\mathcal{S}}}_{\mathrm b}$, then, for all $t\ge 0$, $\phi_u(x, t)\in \bar{{\mathcal{S}}}_{\mathrm b}$, with $u \equiv u_\theta$.

Figures (3)

  • Figure 1: The schematic illustrates the safe set ${\mathcal{S}}_{\mathrm s}$, backup sets ${\mathcal{S}}_{{\mathrm b}_1}$, ${\mathcal{S}}_{{\mathrm b}_2}$, and forward invariant sets ${\mathcal{S}}_{*1}$, ${\mathcal{S}}_{*2}$ which are the finite-time safe backward images of ${\mathcal{S}}_{{\mathrm b}_1}$, ${\mathcal{S}}_{{\mathrm b}_2}$ under the user-designed backup policies $u_{{\mathrm b}_1}$ and $u_{{\mathrm b}_2}$. The finite-time safe backward image of a set consists of all initial states from which trajectories, under a backup policy, remain in ${\mathcal{S}}_{\mathrm s}$ for a finite horizon and reach the target set within that horizon. $\phi_{u_{{\mathrm b}_1}}$ and $\phi_{u_{{\mathrm b}_2}}$ demonstrate two of these trajectories corresponding to $u_{{\mathrm b}_1}$ and $u_{{\mathrm b}_2}$, respectively. The neural backup policy $u_\theta$ serves as an additional backup policy (i.e., $u_{{\mathrm b}_3} \equiv u_\theta$), trained using the RLBUS algorithm to expand the finite-time safe backward image of a subset of ${\mathcal{S}}_{{\mathrm b}_1} \cup {\mathcal{S}}_{{\mathrm b}_2}$. The set $S_{*_3}$ represents the finite-time safe backward image of a subset of this subset, while $\phi_{u_\theta}$ denotes sample trajecotries generated under $u_\theta$. The RLBUS algorithm ensures that $u_\theta$ is trained without any safety violations during training. It initiates exploration from the existing forward invariant subset of the safe set ${\mathcal{S}}_{*1} \cup {\mathcal{S}}_{*2}$ to achieve an expanded forward invariant set ${\mathcal{S}}_{*1} \cup {\mathcal{S}}_{*2} \cup {\mathcal{S}}_{*3}$.
  • Figure 2: Illustration of ${\mathcal{S}}_{\mathrm s}$, ${\mathcal{S}}_{{\mathrm b}_1}$, ${\mathcal{S}}_{{\mathrm b}_2}$, ${\mathcal{S}}_{{\mathrm b}_3}$, ${\mathcal{S}}_{1}$, ${\mathcal{S}}_{2}$, ${\mathcal{S}}_{2}$, and ${\mathcal{R}}$.
  • Figure 3: Safety violations and sampling returns of the performance agent across three scenarios: (i) SAC, (ii) SAC with BCBF, and (iii) SAC with RLBUS.

Theorems & Definitions (10)

  • Definition 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Theorem 1
  • Remark 1
  • lemma 1
  • lemma 2