Counterexample-Guided Repair of Reinforcement Learning Systems Using Safety Critics
David Boetius, Stefan Leue
TL;DR
The paper addresses unsafe behavior in deep reinforcement learning by proposing a counterexample-guided repair framework that uses learnable safety critics to quantify safety and guide repair. It formalizes safe RL within a CMDP and introduces safety critics that predict safety margins from state trajectories, enabling gradient-based constrained optimization to remove counterexamples. The method alternates between identifying counterexamples and removing them, while simultaneously repairing the safety critic to avoid forgetting previous counterexamples, and it discusses termination concerns and practical mitigations. Overall, the work aims to enable safe RL without costly full retraining or abstraction-based methods, with future work focusing on empirical evaluation and comparisons to existing safe-RL approaches.
Abstract
Naively trained Deep Reinforcement Learning agents may fail to satisfy vital safety constraints. To avoid costly retraining, we may desire to repair a previously trained reinforcement learning agent to obviate unsafe behaviour. We devise a counterexample-guided repair algorithm for repairing reinforcement learning systems leveraging safety critics. The algorithm jointly repairs a reinforcement learning agent and a safety critic using gradient-based constrained optimisation.
