Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning
Huy Hoang, Tien Mai, Pradeep Varakantham
TL;DR
The paper tackles constrained reinforcement learning where trajectory costs $R(\tau)$ and $C(\tau)$ are hard to estimate, proposing Self-Imitation Learning (SIM) that labels trajectories as 'good' when $R(\tau) \ge R_G$ and $C(\tau) \le c_{max}$ and 'bad' when $R(\tau) < R_B$ or $C(\tau) > c_{max}$, then learns by imitating the former while avoiding the latter. It replaces cost-function surrogates with a non-adversarial distribution-matching objective that combines good and bad demonstrations via a mixed occupancy $\rho^{G,\pi}$ and a posterior ratio $K(s,a)$ to maximize $\mathbb{E}_{(s,a)\sim \rho^{G,\pi}}[\log \frac{K(s,a)}{1-K(s,a)}]$, updated alongside the policy (e.g., PPO). Theoretical results show that with $\lambda>0$ the approach improves reward while respecting the cost constraint, and empirical results on SafetyGym, CVaR settings, and unknown-cost scenarios demonstrate SIM's superiority over state-of-the-art constrained RL baselines and its robustness across starting policies. The work advances safe RL by enabling learning from evolving demonstrations without cost-function estimation, with practical impact in domains where costs are noisy or unavailable.
Abstract
A popular framework for enforcing safe actions in Reinforcement Learning (RL) is Constrained RL, where trajectory based constraints on expected cost (or other cost measures) are employed to enforce safety and more importantly these constraints are enforced while maximizing expected reward. Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem that can be solved using minor modifications to RL methods. A key drawback with such approaches is an over or underestimation of the cost constraint at each state. Therefore, we provide an approach that does not modify the trajectory based cost constraint and instead imitates ``good'' trajectories and avoids ``bad'' trajectories generated from incrementally improving policies. We employ an oracle that utilizes a reward threshold (which is varied with learning) and the overall cost constraint to label trajectories as ``good'' or ``bad''. A key advantage of our approach is that we are able to work from any starting policy or set of trajectories and improve on it. In an exhaustive set of experiments, we demonstrate that our approach is able to outperform top benchmark approaches for solving Constrained RL problems, with respect to expected cost, CVaR cost, or even unknown cost constraints.
