Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning

Huy Hoang; Tien Mai; Pradeep Varakantham

Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning

Huy Hoang, Tien Mai, Pradeep Varakantham

TL;DR

The paper tackles constrained reinforcement learning where trajectory costs $R(\tau)$ and $C(\tau)$ are hard to estimate, proposing Self-Imitation Learning (SIM) that labels trajectories as 'good' when $R(\tau) \ge R_G$ and $C(\tau) \le c_{max}$ and 'bad' when $R(\tau) < R_B$ or $C(\tau) > c_{max}$, then learns by imitating the former while avoiding the latter. It replaces cost-function surrogates with a non-adversarial distribution-matching objective that combines good and bad demonstrations via a mixed occupancy $\rho^{G,\pi}$ and a posterior ratio $K(s,a)$ to maximize $\mathbb{E}_{(s,a)\sim \rho^{G,\pi}}[\log \frac{K(s,a)}{1-K(s,a)}]$, updated alongside the policy (e.g., PPO). Theoretical results show that with $\lambda>0$ the approach improves reward while respecting the cost constraint, and empirical results on SafetyGym, CVaR settings, and unknown-cost scenarios demonstrate SIM's superiority over state-of-the-art constrained RL baselines and its robustness across starting policies. The work advances safe RL by enabling learning from evolving demonstrations without cost-function estimation, with practical impact in domains where costs are noisy or unavailable.

Abstract

A popular framework for enforcing safe actions in Reinforcement Learning (RL) is Constrained RL, where trajectory based constraints on expected cost (or other cost measures) are employed to enforce safety and more importantly these constraints are enforced while maximizing expected reward. Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem that can be solved using minor modifications to RL methods. A key drawback with such approaches is an over or underestimation of the cost constraint at each state. Therefore, we provide an approach that does not modify the trajectory based cost constraint and instead imitates ``good'' trajectories and avoids ``bad'' trajectories generated from incrementally improving policies. We employ an oracle that utilizes a reward threshold (which is varied with learning) and the overall cost constraint to label trajectories as ``good'' or ``bad''. A key advantage of our approach is that we are able to work from any starting policy or set of trajectories and improve on it. In an exhaustive set of experiments, we demonstrate that our approach is able to outperform top benchmark approaches for solving Constrained RL problems, with respect to expected cost, CVaR cost, or even unknown cost constraints.

Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning

TL;DR

The paper tackles constrained reinforcement learning where trajectory costs

and

are hard to estimate, proposing Self-Imitation Learning (SIM) that labels trajectories as 'good' when

and

and 'bad' when

, then learns by imitating the former while avoiding the latter. It replaces cost-function surrogates with a non-adversarial distribution-matching objective that combines good and bad demonstrations via a mixed occupancy

and a posterior ratio

to maximize

, updated alongside the policy (e.g., PPO). Theoretical results show that with

the approach improves reward while respecting the cost constraint, and empirical results on SafetyGym, CVaR settings, and unknown-cost scenarios demonstrate SIM's superiority over state-of-the-art constrained RL baselines and its robustness across starting policies. The work advances safe RL by enabling learning from evolving demonstrations without cost-function estimation, with practical impact in domains where costs are noisy or unavailable.

Abstract

Paper Structure (43 sections, 7 theorems, 30 equations, 60 figures, 3 tables, 1 algorithm)

This paper contains 43 sections, 7 theorems, 30 equations, 60 figures, 3 tables, 1 algorithm.

Introduction
Background
Constrained Markov Decision Process
Imitation Learning
Behavioral Cloning.
Distribution matching.
Self-Imitation Learning Approach
Learning from Good and Bad Demonstrations
Theoretical Insights
Example
Self-Imitation based Safe RL
EXPERIMENTS
SIM vs other Constrained RL methods on SafetyGym
SIM vs GAIL, and the Importance of "Good" and "Bad" Demonstrations
SIM vs Behavioral Cloning
...and 28 more sections

Key Result

Lemma 1

For any $\lambda>0$, if there exists a policy $\pi^*$ such that $P_{\pi^*}(\tau) = 0$ for all $\tau \in \Omega^B$, and $P_{\pi^*}(\tau) = \frac{P_{\pi^0}(\tau)}{ \sum_{\tau'\in \Omega^G}P_{\pi^0}(\tau')};~\forall \tau \in \Omega^G$ then $\pi^*$ is an optimal policy to BC-good-bad.

Figures (60)

Figure 1: Example
Figure 6: Example
Figure 7: Overview of SIM
Figure 8: Although a significant number of trajectories do not satisfy the constraints (red lines), the relaxed-constraint setting is still able to offer a considerable number of good trajectories (green lines).
Figure : SafetyPointGoal
...and 55 more figures

Theorems & Definitions (15)

Definition 1
Lemma 1
Proposition 1
Lemma 2
Proposition 2
Proposition 3
Proposition 4
proof
Lemma 3
proof
...and 5 more

Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning

TL;DR

Abstract

Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (60)

Theorems & Definitions (15)