Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

Abhijit Mazumdar; Rafal Wisniewski; Manuela L. Bujorianu

Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

Abhijit Mazumdar, Rafal Wisniewski, Manuela L. Bujorianu

TL;DR

The paper tackles learning optimal policies for Constrained MDPs under stochastic stopping time while strictly maintaining safety during learning. It develops a model-free, LP-based $p$-Safe RL framework that uses optimism under uncertainty, a safe baseline policy, and a proxy-set to enable safe exploration; the approach guarantees safety with high confidence and converges to the safe optimum as information accrues. A formal analysis links the safety function $S^P_{c0^{x_0}}(x)$ to stopping-time dynamics and occupation measures, and the method is validated on a representative illustrating MDP showing favorable regret and guaranteed safety. The work outlines practical extensions, including regret analysis and function approximation for scalability, informing safe exploration in real-world safety-critical domains where stopping times are stochastic.

Abstract

In this paper, we present an online reinforcement learning algorithm for constrained Markov decision processes with a safety constraint. Despite the necessary attention of the scientific community, considering stochastic stopping time, the problem of learning optimal policy without violating safety constraints during the learning phase is yet to be addressed. To this end, we propose an algorithm based on linear programming that does not require a process model. We show that the learned policy is safe with high confidence. We also propose a method to compute a safe baseline policy, which is central in developing algorithms that do not violate the safety constraints. Finally, we provide simulation results to show the efficacy of the proposed algorithm. Further, we demonstrate that efficient exploration can be achieved by defining a subset of the state-space called proxy set.

Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

TL;DR

The paper tackles learning optimal policies for Constrained MDPs under stochastic stopping time while strictly maintaining safety during learning. It develops a model-free, LP-based

-Safe RL framework that uses optimism under uncertainty, a safe baseline policy, and a proxy-set to enable safe exploration; the approach guarantees safety with high confidence and converges to the safe optimum as information accrues. A formal analysis links the safety function

to stopping-time dynamics and occupation measures, and the method is validated on a representative illustrating MDP showing favorable regret and guaranteed safety. The work outlines practical extensions, including regret analysis and function approximation for scalability, informing safe exploration in real-world safety-critical domains where stopping times are stochastic.

Abstract

Paper Structure (8 sections, 7 theorems, 36 equations, 3 figures, 1 algorithm)

This paper contains 8 sections, 7 theorems, 36 equations, 3 figures, 1 algorithm.

Introduction
Background and Problem Formulation:
Safe policy Learning
Probabilistic Safety:
Safe Policy Design:
$p$-Safe Learning Algorithm (Algorithm $1$):
Illustrating Example
Conclusion and Future Work

Key Result

Lemma 1

Suppose, for a given policy $\pi$ and safety parameter vector $d$, at instant $t$, the safety constraint $\mathscr{P}_t\leq d$ is satisfied. Then, with probability $1$, an unsafe state will be eventually visited.

Figures (3)

Figure 1: Example MDP
Figure 2: Evolution of Per-episode Objective and Constraint Regret
Figure 3: Objective Regret with and without knowledge of the proxy states

Theorems & Definitions (17)

Definition 1: $p$-safety
Definition 2: Proxy Set
Remark 1
Definition 3: Safe Action
Lemma 1
proof
Lemma 2
proof
Lemma 3
Remark 2
...and 7 more

Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

TL;DR

Abstract

Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (17)