Table of Contents
Fetching ...

Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

Abhijit Mazumdar, Rafal Wisniewski, Manuela L. Bujorianu

TL;DR

The paper tackles learning optimal policies for Constrained MDPs under stochastic stopping time while strictly maintaining safety during learning. It develops a model-free, LP-based $p$-Safe RL framework that uses optimism under uncertainty, a safe baseline policy, and a proxy-set to enable safe exploration; the approach guarantees safety with high confidence and converges to the safe optimum as information accrues. A formal analysis links the safety function $S^P_{c0^{x_0}}(x)$ to stopping-time dynamics and occupation measures, and the method is validated on a representative illustrating MDP showing favorable regret and guaranteed safety. The work outlines practical extensions, including regret analysis and function approximation for scalability, informing safe exploration in real-world safety-critical domains where stopping times are stochastic.

Abstract

In this paper, we present an online reinforcement learning algorithm for constrained Markov decision processes with a safety constraint. Despite the necessary attention of the scientific community, considering stochastic stopping time, the problem of learning optimal policy without violating safety constraints during the learning phase is yet to be addressed. To this end, we propose an algorithm based on linear programming that does not require a process model. We show that the learned policy is safe with high confidence. We also propose a method to compute a safe baseline policy, which is central in developing algorithms that do not violate the safety constraints. Finally, we provide simulation results to show the efficacy of the proposed algorithm. Further, we demonstrate that efficient exploration can be achieved by defining a subset of the state-space called proxy set.

Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

TL;DR

The paper tackles learning optimal policies for Constrained MDPs under stochastic stopping time while strictly maintaining safety during learning. It develops a model-free, LP-based -Safe RL framework that uses optimism under uncertainty, a safe baseline policy, and a proxy-set to enable safe exploration; the approach guarantees safety with high confidence and converges to the safe optimum as information accrues. A formal analysis links the safety function to stopping-time dynamics and occupation measures, and the method is validated on a representative illustrating MDP showing favorable regret and guaranteed safety. The work outlines practical extensions, including regret analysis and function approximation for scalability, informing safe exploration in real-world safety-critical domains where stopping times are stochastic.

Abstract

In this paper, we present an online reinforcement learning algorithm for constrained Markov decision processes with a safety constraint. Despite the necessary attention of the scientific community, considering stochastic stopping time, the problem of learning optimal policy without violating safety constraints during the learning phase is yet to be addressed. To this end, we propose an algorithm based on linear programming that does not require a process model. We show that the learned policy is safe with high confidence. We also propose a method to compute a safe baseline policy, which is central in developing algorithms that do not violate the safety constraints. Finally, we provide simulation results to show the efficacy of the proposed algorithm. Further, we demonstrate that efficient exploration can be achieved by defining a subset of the state-space called proxy set.
Paper Structure (8 sections, 7 theorems, 36 equations, 3 figures, 1 algorithm)

This paper contains 8 sections, 7 theorems, 36 equations, 3 figures, 1 algorithm.

Key Result

Lemma 1

Suppose, for a given policy $\pi$ and safety parameter vector $d$, at instant $t$, the safety constraint $\mathscr{P}_t\leq d$ is satisfied. Then, with probability $1$, an unsafe state will be eventually visited.

Figures (3)

  • Figure 1: Example MDP
  • Figure 2: Evolution of Per-episode Objective and Constraint Regret
  • Figure 3: Objective Regret with and without knowledge of the proxy states

Theorems & Definitions (17)

  • Definition 1: $p$-safety
  • Definition 2: Proxy Set
  • Remark 1
  • Definition 3: Safe Action
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • Remark 2
  • ...and 7 more