Probabilistic Model Checking of Stochastic Reinforcement Learning Policies

Dennis Gross; Helge Spieker

Probabilistic Model Checking of Stochastic Reinforcement Learning Policies

Dennis Gross, Helge Spieker

TL;DR

This work introduces a method to verify stochastic reinforcement learning (RL) policies, leveraging a Markov decision process, a trained RL policy, and a probabilistic computation tree logic (PCTL) formula to build a formal model that can be subsequently verified via the model checker Storm.

Abstract

We introduce a method to verify stochastic reinforcement learning (RL) policies. This approach is compatible with any RL algorithm as long as the algorithm and its corresponding environment collectively adhere to the Markov property. In this setting, the future state of the environment should depend solely on its current state and the action executed, independent of any previous states or actions. Our method integrates a verification technique, referred to as model checking, with RL, leveraging a Markov decision process, a trained RL policy, and a probabilistic computation tree logic (PCTL) formula to build a formal model that can be subsequently verified via the model checker Storm. We demonstrate our method's applicability across multiple benchmarks, comparing it to baseline methods called deterministic safety estimates and naive monolithic model checking. Our results show that our method is suited to verify stochastic RL policies.

Probabilistic Model Checking of Stochastic Reinforcement Learning Policies

TL;DR

Abstract

Paper Structure (31 sections, 2 equations, 4 figures, 2 tables)

This paper contains 31 sections, 2 equations, 4 figures, 2 tables.

INTRODUCTION
RL problem.
Hard to verify.
Approach.
RELATED WORK
Background
Probabilistic Model Checking
Reinforcement Learning
Methodology
Running example.
Induced MDP Construction
Induced DTMC Transformation
Safety Verification
Limitations
Experiments
...and 16 more sections

Figures (4)

Figure 1: This diagram represents an RL system in which an agent interacts with an environment. The agent receives a state and a reward from the environment based on its previous action. The agent then uses this information to select the next action, which it sends to the environment.
Figure 2: The MDP $M$ with $S = \{x=1,x=2,x=3,x=4\}$, A={UP,NOP,DOWN}, and $rew \colon S \times A \rightarrow \{1\}$.
Figure 3: Induced MDP $\hat{M}$. We add the state-action transition to the MDP for each action with a probability greater than zero $\hat{\pi}(a|s)$. In this example, the chosen actions are UP and DOWN with the corresponding probabilities $\hat{\pi}(UP|x=1) = 0.3$ and $\hat{\pi}(DOWN|x=1) = 0.7$ at state $x=1$.
Figure 4: Transitions update for $\hat{D}$. For example, $\hat{Tr}_{\hat{D}}(x=1)(x=2) = Tr(x=1,UP,x=2) \cdot \hat{\pi}(UP\mid x=1) = 0.2 \cdot 0.3 = 0.06$ at state $x=1$. Repeat with step from Figure \ref{['fig:combined_actions']} for all reachable states by the trained RL policy.

Theorems & Definitions (6)

Definition 1: Markov Decision Process
Definition 2: Discrete-time Markov Chain
Definition 3: Deterministic Policy
Definition 4: Stochastic Policy
Definition 5: Permissive Policy
Definition 6: PCTL Syntax

Probabilistic Model Checking of Stochastic Reinforcement Learning Policies

TL;DR

Abstract

Probabilistic Model Checking of Stochastic Reinforcement Learning Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (6)