Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning

Sean Vaskov; Wilko Schwarting; Chris L. Baker

Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning

Sean Vaskov, Wilko Schwarting, Chris L. Baker

TL;DR

This paper tackles safe reinforcement learning by introducing counterfactual constraint formulations that penalize only the harm a learner causes relative to a safe default policy, using viability theory to relate initial states, uncertainty, and safety. It proposes two main mechanisms: clipped CCATE and counterfactual Harm, both estimated online within a PPO framework using TD($\lambda$) style max operators to capture infinite-horizon safety effects. The methods are implemented with separate critics and counterfactual inference (N-step lookahead) for state-wise safety, and evaluated on a rover with uncertain friction and a tractor-trailer parking task, where Harm-based constraints yield higher safety recall and lower harm than traditional baselines. The results suggest that counterfactual safety constraints improve robustness and safety in RL, with practical trade-offs in computation and potential for extension to shielding and hierarchical control frameworks.

Abstract

Reinforcement Learning (RL) for control has become increasingly popular due to its ability to learn rich feedback policies that take into account uncertainty and complex representations of the environment. When considering safety constraints, constrained optimization approaches, where agents are penalized for constraint violations, are commonly used. In such methods, if agents are initialized in, or must visit, states where constraint violation might be inevitable, it is unclear how much they should be penalized. We address this challenge by formulating a constraint on the counterfactual harm of the learned policy compared to a default, safe policy. In a philosophical sense this formulation only penalizes the learner for constraint violations that it caused; in a practical sense it maintains feasibility of the optimal control problem. We present simulation studies on a rover with uncertain road friction and a tractor-trailer parking environment that demonstrate our constraint formulation enables agents to learn safer policies than contemporary constrained RL methods.

Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning

TL;DR

) style max operators to capture infinite-horizon safety effects. The methods are implemented with separate critics and counterfactual inference (N-step lookahead) for state-wise safety, and evaluated on a rover with uncertain friction and a tractor-trailer parking task, where Harm-based constraints yield higher safety recall and lower harm than traditional baselines. The results suggest that counterfactual safety constraints improve robustness and safety in RL, with practical trade-offs in computation and potential for extension to shielding and hierarchical control frameworks.

Abstract

Paper Structure (17 sections, 3 theorems, 26 equations, 7 figures, 1 table)

This paper contains 17 sections, 3 theorems, 26 equations, 7 figures, 1 table.

Introduction
Preliminaries
Formulations
Clipped Conditional Average Treatment Effect
Counterfactual Harm
Learning Implementation
Experiments
Rover
Tractor-Trailer Parking
Discussion
Conclusion
Extensions to Shielding and Hierarchical Methods
TD($\lambda)$ Estimate for Max Operator
Extended Results
Training Hyperparameters
...and 2 more sections

Key Result

lemma 1

Given $x,y,z\in \mathbb{R},\ |\max(x,y)-\max(x,z)|\leq |y-z|$

Figures (7)

Figure 1: Viability Statistics for rover
Figure 2: CC (red) and HARM_C (blue) policies
Figure 3: Viability statistics for tractor-trailer
Figure 4: CC_0 (red) and HARM (blue) policies
Figure 5: Cumulative distribution of harm (left) and constraint violations (right) for tractor-trailer. The black dashed lines are generated by executing the default policy, $\mu$, from the initial states.
...and 2 more figures

Theorems & Definitions (8)

remark 1
remark 2
lemma 1
proof
theorem 3
proof
theorem 4
proof

Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning

TL;DR

Abstract

Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (8)