Table of Contents
Fetching ...

Counterexample-Guided Repair of Reinforcement Learning Systems Using Safety Critics

David Boetius, Stefan Leue

TL;DR

The paper addresses unsafe behavior in deep reinforcement learning by proposing a counterexample-guided repair framework that uses learnable safety critics to quantify safety and guide repair. It formalizes safe RL within a CMDP and introduces safety critics that predict safety margins from state trajectories, enabling gradient-based constrained optimization to remove counterexamples. The method alternates between identifying counterexamples and removing them, while simultaneously repairing the safety critic to avoid forgetting previous counterexamples, and it discusses termination concerns and practical mitigations. Overall, the work aims to enable safe RL without costly full retraining or abstraction-based methods, with future work focusing on empirical evaluation and comparisons to existing safe-RL approaches.

Abstract

Naively trained Deep Reinforcement Learning agents may fail to satisfy vital safety constraints. To avoid costly retraining, we may desire to repair a previously trained reinforcement learning agent to obviate unsafe behaviour. We devise a counterexample-guided repair algorithm for repairing reinforcement learning systems leveraging safety critics. The algorithm jointly repairs a reinforcement learning agent and a safety critic using gradient-based constrained optimisation.

Counterexample-Guided Repair of Reinforcement Learning Systems Using Safety Critics

TL;DR

The paper addresses unsafe behavior in deep reinforcement learning by proposing a counterexample-guided repair framework that uses learnable safety critics to quantify safety and guide repair. It formalizes safe RL within a CMDP and introduces safety critics that predict safety margins from state trajectories, enabling gradient-based constrained optimization to remove counterexamples. The method alternates between identifying counterexamples and removing them, while simultaneously repairing the safety critic to avoid forgetting previous counterexamples, and it discusses termination concerns and practical mitigations. Overall, the work aims to enable safe RL without costly full retraining or abstraction-based methods, with future work focusing on empirical evaluation and comparisons to existing safe-RL approaches.

Abstract

Naively trained Deep Reinforcement Learning agents may fail to satisfy vital safety constraints. To avoid costly retraining, we may desire to repair a previously trained reinforcement learning agent to obviate unsafe behaviour. We devise a counterexample-guided repair algorithm for repairing reinforcement learning systems leveraging safety critics. The algorithm jointly repairs a reinforcement learning agent and a safety critic using gradient-based constrained optimisation.
Paper Structure (12 sections, 1 theorem, 6 equations, 2 algorithms)

This paper contains 12 sections, 1 theorem, 6 equations, 2 algorithms.

Key Result

proposition thmcounterproposition

A policy $\pi_{\IfNoValueTF{-NoValue-}{\boldsymbol{\theta}}{-NoValue-}}\IfNoValueF{-NoValue-}{{\left(-NoValue-\right)}}$ is safe whenever a verifier does not produce a counterexample for $\pi_{\IfNoValueTF{-NoValue-}{\boldsymbol{\theta}}{-NoValue-}}\IfNoValueF{-NoValue-}{{\left(-NoValue-\right)}}$.

Theorems & Definitions (8)

  • definition thmcounterdefinition: CMDP
  • definition thmcounterdefinition: Return
  • definition thmcounterdefinition: Safe Trajectories
  • definition thmcounterdefinition: Satisfaction Function
  • definition thmcounterdefinition: Counterexample
  • definition thmcounterdefinition: Soundness and Completeness
  • proposition thmcounterproposition
  • proof