Table of Contents
Fetching ...

Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning

Sean Vaskov, Wilko Schwarting, Chris L. Baker

TL;DR

This paper tackles safe reinforcement learning by introducing counterfactual constraint formulations that penalize only the harm a learner causes relative to a safe default policy, using viability theory to relate initial states, uncertainty, and safety. It proposes two main mechanisms: clipped CCATE and counterfactual Harm, both estimated online within a PPO framework using TD($\lambda$) style max operators to capture infinite-horizon safety effects. The methods are implemented with separate critics and counterfactual inference (N-step lookahead) for state-wise safety, and evaluated on a rover with uncertain friction and a tractor-trailer parking task, where Harm-based constraints yield higher safety recall and lower harm than traditional baselines. The results suggest that counterfactual safety constraints improve robustness and safety in RL, with practical trade-offs in computation and potential for extension to shielding and hierarchical control frameworks.

Abstract

Reinforcement Learning (RL) for control has become increasingly popular due to its ability to learn rich feedback policies that take into account uncertainty and complex representations of the environment. When considering safety constraints, constrained optimization approaches, where agents are penalized for constraint violations, are commonly used. In such methods, if agents are initialized in, or must visit, states where constraint violation might be inevitable, it is unclear how much they should be penalized. We address this challenge by formulating a constraint on the counterfactual harm of the learned policy compared to a default, safe policy. In a philosophical sense this formulation only penalizes the learner for constraint violations that it caused; in a practical sense it maintains feasibility of the optimal control problem. We present simulation studies on a rover with uncertain road friction and a tractor-trailer parking environment that demonstrate our constraint formulation enables agents to learn safer policies than contemporary constrained RL methods.

Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning

TL;DR

This paper tackles safe reinforcement learning by introducing counterfactual constraint formulations that penalize only the harm a learner causes relative to a safe default policy, using viability theory to relate initial states, uncertainty, and safety. It proposes two main mechanisms: clipped CCATE and counterfactual Harm, both estimated online within a PPO framework using TD() style max operators to capture infinite-horizon safety effects. The methods are implemented with separate critics and counterfactual inference (N-step lookahead) for state-wise safety, and evaluated on a rover with uncertain friction and a tractor-trailer parking task, where Harm-based constraints yield higher safety recall and lower harm than traditional baselines. The results suggest that counterfactual safety constraints improve robustness and safety in RL, with practical trade-offs in computation and potential for extension to shielding and hierarchical control frameworks.

Abstract

Reinforcement Learning (RL) for control has become increasingly popular due to its ability to learn rich feedback policies that take into account uncertainty and complex representations of the environment. When considering safety constraints, constrained optimization approaches, where agents are penalized for constraint violations, are commonly used. In such methods, if agents are initialized in, or must visit, states where constraint violation might be inevitable, it is unclear how much they should be penalized. We address this challenge by formulating a constraint on the counterfactual harm of the learned policy compared to a default, safe policy. In a philosophical sense this formulation only penalizes the learner for constraint violations that it caused; in a practical sense it maintains feasibility of the optimal control problem. We present simulation studies on a rover with uncertain road friction and a tractor-trailer parking environment that demonstrate our constraint formulation enables agents to learn safer policies than contemporary constrained RL methods.
Paper Structure (17 sections, 3 theorems, 26 equations, 7 figures, 1 table)

This paper contains 17 sections, 3 theorems, 26 equations, 7 figures, 1 table.

Key Result

lemma 1

Given $x,y,z\in \mathbb{R},\ |\max(x,y)-\max(x,z)|\leq |y-z|$

Figures (7)

  • Figure 1: Viability Statistics for rover
  • Figure 2: CC (red) and HARM_C (blue) policies
  • Figure 3: Viability statistics for tractor-trailer
  • Figure 4: CC_0 (red) and HARM (blue) policies
  • Figure 5: Cumulative distribution of harm (left) and constraint violations (right) for tractor-trailer. The black dashed lines are generated by executing the default policy, $\mu$, from the initial states.
  • ...and 2 more figures

Theorems & Definitions (8)

  • remark 1
  • remark 2
  • lemma 1
  • proof
  • theorem 3
  • proof
  • theorem 4
  • proof