Learning safety critics via a non-contractive binary bellman operator
Agustin Castellano, Hancheng Min, Juan Andrés Bazerque, Enrique Mallada
TL;DR
This work tackles the challenge of guaranteeing safety in RL by introducing a binary safety critic that marks state-action pairs as safe or unsafe. It reframes safety with a non-contractive binary Bellman operator, leading to a binary Bellman equation (B2E) whose fixed points correspond to persistently safe regions known as control invariant safe (CIS) sets, apart from a spurious all-ones solution. The authors prove theoretical structure for the fixed points and present a learning algorithm that leverages axiomatic safe data and a self-consistency loss to avoid incorrect fixed points. Empirical results on an inverted-pendulum task show that the proposed method yields safer, more exploratory policies compared with a risk-based safety critic baseline, demonstrating practical potential for persistent safety in real-world RL. The approach provides a principled route to correct-by-design safety without discounting, with implications for safety-critical applications where avoiding failure is paramount.
Abstract
The inability to naturally enforce safety in Reinforcement Learning (RL), with limited failures, is a core challenge impeding its use in real-world applications. One notion of safety of vast practical relevance is the ability to avoid (unsafe) regions of the state space. Though such a safety goal can be captured by an action-value-like function, a.k.a. safety critics, the associated operator lacks the desired contraction and uniqueness properties that the classical Bellman operator enjoys. In this work, we overcome the non-contractiveness of safety critic operators by leveraging that safety is a binary property. To that end, we study the properties of the binary safety critic associated with a deterministic dynamical system that seeks to avoid reaching an unsafe region. We formulate the corresponding binary Bellman equation (B2E) for safety and study its properties. While the resulting operator is still non-contractive, we fully characterize its fixed points representing--except for a spurious solution--maximal persistently safe regions of the state space that can always avoid failure. We provide an algorithm that, by design, leverages axiomatic knowledge of safe data to avoid spurious fixed points.
