Enhancing RL Safety with Counterfactual LLM Reasoning
Dennis Gross, Helge Spieker
TL;DR
The paper tackles unsafe and opaque behavior in reinforcement learning by post-hoc safety repair using counterfactual reasoning from large language models guided by probabilistic model checking. It builds an induced DTMC from the environment and policy, uses the Storm model checker to quantify safety with $PCTL$ properties, and identifies unsafe state–action situations. For each such situation, an LLM provides an explanation and a safer alternative action, after which the DTMC is re-verified to yield an updated safety measure $m'$. Compared with a baseline that selects the second-best action, the approach yields explainable safety repairs that improve the policy's safety performance, demonstrating a practical pathway to safer RL post-training with enhanced interpretability.
Abstract
Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.
