Learning safety critics via a non-contractive binary bellman operator

Agustin Castellano; Hancheng Min; Juan Andrés Bazerque; Enrique Mallada

Learning safety critics via a non-contractive binary bellman operator

Agustin Castellano, Hancheng Min, Juan Andrés Bazerque, Enrique Mallada

TL;DR

This work tackles the challenge of guaranteeing safety in RL by introducing a binary safety critic that marks state-action pairs as safe or unsafe. It reframes safety with a non-contractive binary Bellman operator, leading to a binary Bellman equation (B2E) whose fixed points correspond to persistently safe regions known as control invariant safe (CIS) sets, apart from a spurious all-ones solution. The authors prove theoretical structure for the fixed points and present a learning algorithm that leverages axiomatic safe data and a self-consistency loss to avoid incorrect fixed points. Empirical results on an inverted-pendulum task show that the proposed method yields safer, more exploratory policies compared with a risk-based safety critic baseline, demonstrating practical potential for persistent safety in real-world RL. The approach provides a principled route to correct-by-design safety without discounting, with implications for safety-critical applications where avoiding failure is paramount.

Abstract

The inability to naturally enforce safety in Reinforcement Learning (RL), with limited failures, is a core challenge impeding its use in real-world applications. One notion of safety of vast practical relevance is the ability to avoid (unsafe) regions of the state space. Though such a safety goal can be captured by an action-value-like function, a.k.a. safety critics, the associated operator lacks the desired contraction and uniqueness properties that the classical Bellman operator enjoys. In this work, we overcome the non-contractiveness of safety critic operators by leveraging that safety is a binary property. To that end, we study the properties of the binary safety critic associated with a deterministic dynamical system that seeks to avoid reaching an unsafe region. We formulate the corresponding binary Bellman equation (B2E) for safety and study its properties. While the resulting operator is still non-contractive, we fully characterize its fixed points representing--except for a spurious solution--maximal persistently safe regions of the state space that can always avoid failure. We provide an algorithm that, by design, leverages axiomatic knowledge of safe data to avoid spurious fixed points.

Learning safety critics via a non-contractive binary bellman operator

TL;DR

Abstract

Paper Structure (34 sections, 5 theorems, 32 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 34 sections, 5 theorems, 32 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Contributions of our work
Problem Formulation
Environment
Policies
Relationship between safety and the optimal binary functions
Unsafety as a logical OR
Non-contractive Bellman operator
Closely Related Work
Control-theoretic approaches for computing $\mathcal{S}_{\texttt{safe}}$
Risk-based vs Reachability-based safety critics
To contract or not to contract
Binary characterization of safety
Algorithm
Dataset
...and 19 more sections

Key Result

Proposition 1

For any policy $\pi$, the following set of Bellman equations hold for all $s\in\mathcal{S}$, for all $a\in\mathcal{A}$: $b^\pi(s,a) = i(s) + (1-i(s))v^\pi(s')$, where $s'=F(s,a)$. In particular, any optimal policy satisfies:

Figures (5)

Figure 1: The optimal $b^\star$ describes different regions of the state space. The set $\mathcal{G}$ (solid red) is to be avoided at all times. Due to system dynamics, there is a region of the state space $\mathcal{R}(\mathcal{G})$ (shaded red) such that any trajectory starting there (e.g., from $s_0$) will inevitably enter $\mathcal{G}$. For any point in its complement $\mathcal{S}_{\texttt{safe}}$ (e.g. $s_1$), the optimal policy avoids $\mathcal{G}$ at all times.
Figure 2: An illustration of Theorem \ref{['thm:fixed-points']}. Left: a valid fixed point $\tilde{b}$ of $\mathcal{T}$ and its corresponding safe control invariant set. Trajectories starting in $\mathcal{C}$ can be driven to remain in $\mathcal{C}$. Right: a function $\tilde{b}$ that is not a fixed point. A state $s_{\texttt{int}}$ in the intersection will inevitably lead to the unsafe region $\mathcal{G}$, so $\tilde{b}(s,a)$ should be $1$ for all states in the trajectory (which would mean $s_\texttt{int}\notin\mathcal{C}$). Similarly, a state $s_{\texttt{out}}$ outside $\mathcal{C}$ cannot reach inside. If it could, $\tilde{b}(s_{\texttt{out}},a)=1$ for some $a\in\mathcal{A}$, but it would transition to a state where $\min_{a'}\tilde{b}(s',a')=0$, violating \ref{['eq:bellman-b']}.
Figure 3: The custom inverted pendulum environment, with state $s=[\theta, \omega]^\top$. The region past the horizontal $\mathcal{G}$ is to be avoided at all times.
Figure 4: Learned safe regions for the inverted pendulum problem during training. Each panel depicts the learned barrier for a fixed action (maximum clockwise torque, maximum counter-clockwise torque, no torque). The white area corresponds to the states classified as safe (for each of those actions). The solid maroon lines show the boundary of the unsafe region $\mathcal{G}$ (falling past the horizontal). The green region shows the set of states that can avoid $\mathcal{G}$ at all times, and the purple region shows the set of safe states reachable from $\mathcal{D}_{\texttt{safe}}$. These sets were computed using an optimal control toolbox mitchell2005toolbox.
Figure 5: Left: cumulative failures during training of our algorithm (red) and SBE (blue) for the inverted pendulum. Solid lines represent the means across $5$ seeds; shaded areas are $95\%$ confidence intervals. Our algorithm learns safe policies with less failures. Right: safety rate (fraction of safe episodes) and entropy of each learned model. Our algorithm (shaded lines) always uses the uniform safe policy. SBE is tested for different threshold values $\eta$. Our policy achieves almost perfect safety rate and is exploratory (high entropy). Only the most conservative SBE policies (large $\eta$) are $100\%$ safe, but have low entropy (limited exploration).

Theorems & Definitions (9)

Definition 1: $t$-step reachable sets
Definition 2: Binary safety value functions
Definition 3: Optimal binary value functions
Proposition 1: Binary Bellman Equations
Definition 4: Control invariant safe (CIS) set
Theorem 1: Fixed points and control invariant safe sets
Corollary 1: Maximality of the CIS set
Proposition 2: Binary Bellman equation for $v^\pi$
Lemma 1

Learning safety critics via a non-contractive binary bellman operator

TL;DR

Abstract

Learning safety critics via a non-contractive binary bellman operator

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (9)