Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time
Abhijit Mazumdar, Rafal Wisniewski, Manuela L. Bujorianu
TL;DR
The paper tackles learning optimal policies for Constrained MDPs under stochastic stopping time while strictly maintaining safety during learning. It develops a model-free, LP-based $p$-Safe RL framework that uses optimism under uncertainty, a safe baseline policy, and a proxy-set to enable safe exploration; the approach guarantees safety with high confidence and converges to the safe optimum as information accrues. A formal analysis links the safety function $S^P_{c0^{x_0}}(x)$ to stopping-time dynamics and occupation measures, and the method is validated on a representative illustrating MDP showing favorable regret and guaranteed safety. The work outlines practical extensions, including regret analysis and function approximation for scalability, informing safe exploration in real-world safety-critical domains where stopping times are stochastic.
Abstract
In this paper, we present an online reinforcement learning algorithm for constrained Markov decision processes with a safety constraint. Despite the necessary attention of the scientific community, considering stochastic stopping time, the problem of learning optimal policy without violating safety constraints during the learning phase is yet to be addressed. To this end, we propose an algorithm based on linear programming that does not require a process model. We show that the learned policy is safe with high confidence. We also propose a method to compute a safe baseline policy, which is central in developing algorithms that do not violate the safety constraints. Finally, we provide simulation results to show the efficacy of the proposed algorithm. Further, we demonstrate that efficient exploration can be achieved by defining a subset of the state-space called proxy set.
