Enhancing RL Safety with Counterfactual LLM Reasoning

Dennis Gross; Helge Spieker

Enhancing RL Safety with Counterfactual LLM Reasoning

Dennis Gross, Helge Spieker

TL;DR

The paper tackles unsafe and opaque behavior in reinforcement learning by post-hoc safety repair using counterfactual reasoning from large language models guided by probabilistic model checking. It builds an induced DTMC from the environment and policy, uses the Storm model checker to quantify safety with $PCTL$ properties, and identifies unsafe state–action situations. For each such situation, an LLM provides an explanation and a safer alternative action, after which the DTMC is re-verified to yield an updated safety measure $m'$. Compared with a baseline that selects the second-best action, the approach yields explainable safety repairs that improve the policy's safety performance, demonstrating a practical pathway to safer RL post-training with enhanced interpretability.

Abstract

Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.

Enhancing RL Safety with Counterfactual LLM Reasoning

TL;DR

properties, and identifies unsafe state–action situations. For each such situation, an LLM provides an explanation and a safer alternative action, after which the DTMC is re-verified to yield an updated safety measure

. Compared with a baseline that selects the second-best action, the approach yields explainable safety repairs that improve the policy's safety performance, demonstrating a practical pathway to safer RL post-training with enhanced interpretability.

Abstract

Paper Structure (6 sections)

This paper contains 6 sections.

Introduction
Related work.
Background
Probabilistic model checking.
Large language models.
Methodolodgy

Theorems & Definitions (1)

definition thmcounterdefinition: MDP

Enhancing RL Safety with Counterfactual LLM Reasoning

TL;DR

Abstract

Enhancing RL Safety with Counterfactual LLM Reasoning

Authors

TL;DR

Abstract

Table of Contents

Theorems & Definitions (1)