Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning
Sean Vaskov, Wilko Schwarting, Chris L. Baker
TL;DR
This paper tackles safe reinforcement learning by introducing counterfactual constraint formulations that penalize only the harm a learner causes relative to a safe default policy, using viability theory to relate initial states, uncertainty, and safety. It proposes two main mechanisms: clipped CCATE and counterfactual Harm, both estimated online within a PPO framework using TD($\lambda$) style max operators to capture infinite-horizon safety effects. The methods are implemented with separate critics and counterfactual inference (N-step lookahead) for state-wise safety, and evaluated on a rover with uncertain friction and a tractor-trailer parking task, where Harm-based constraints yield higher safety recall and lower harm than traditional baselines. The results suggest that counterfactual safety constraints improve robustness and safety in RL, with practical trade-offs in computation and potential for extension to shielding and hierarchical control frameworks.
Abstract
Reinforcement Learning (RL) for control has become increasingly popular due to its ability to learn rich feedback policies that take into account uncertainty and complex representations of the environment. When considering safety constraints, constrained optimization approaches, where agents are penalized for constraint violations, are commonly used. In such methods, if agents are initialized in, or must visit, states where constraint violation might be inevitable, it is unclear how much they should be penalized. We address this challenge by formulating a constraint on the counterfactual harm of the learned policy compared to a default, safe policy. In a philosophical sense this formulation only penalizes the learner for constraint violations that it caused; in a practical sense it maintains feasibility of the optimal control problem. We present simulation studies on a rover with uncertain road friction and a tractor-trailer parking environment that demonstrate our constraint formulation enables agents to learn safer policies than contemporary constrained RL methods.
