Table of Contents
Fetching ...

Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds

Qian Zuo, Fengxiang He

TL;DR

Stochastic Pessimistic-Optimistic Thresholding (SPOT), a novel model-based primal-dual algorithm for multiple constraints against stochastic thresholds, is designed, which is the first reinforcement learning algorithm that realises theoretical guaranteed performance in an uncertain environment where even thresholds are unknown.

Abstract

This paper studies constrained Markov decision processes (CMDPs) with constraints against stochastic thresholds, aiming at safety of reinforcement learning in unknown and uncertain environments. We leverage a Growing-Window estimator sampling from interactions with the uncertain environment to estimate the thresholds, based on which we design Stochastic Pessimistic-Optimistic Thresholding (SPOT), a novel model-based primal-dual algorithm for multiple constraints against stochastic thresholds. SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings. We prove that our algorithm achieves sublinear regret and constraint violation; i.e., a reward regret of $\tilde{\mathcal{O}}(\sqrt{T})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{T})$ constraint violation over $T$ episodes. The theoretical guarantees show that our algorithm achieves performance comparable to that of an approach relying on fixed and clear thresholds. To the best of our knowledge, SPOT is the first reinforcement learning algorithm that realises theoretical guaranteed performance in an uncertain environment where even thresholds are unknown.

Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds

TL;DR

Stochastic Pessimistic-Optimistic Thresholding (SPOT), a novel model-based primal-dual algorithm for multiple constraints against stochastic thresholds, is designed, which is the first reinforcement learning algorithm that realises theoretical guaranteed performance in an uncertain environment where even thresholds are unknown.

Abstract

This paper studies constrained Markov decision processes (CMDPs) with constraints against stochastic thresholds, aiming at safety of reinforcement learning in unknown and uncertain environments. We leverage a Growing-Window estimator sampling from interactions with the uncertain environment to estimate the thresholds, based on which we design Stochastic Pessimistic-Optimistic Thresholding (SPOT), a novel model-based primal-dual algorithm for multiple constraints against stochastic thresholds. SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings. We prove that our algorithm achieves sublinear regret and constraint violation; i.e., a reward regret of while allowing an constraint violation over episodes. The theoretical guarantees show that our algorithm achieves performance comparable to that of an approach relying on fixed and clear thresholds. To the best of our knowledge, SPOT is the first reinforcement learning algorithm that realises theoretical guaranteed performance in an uncertain environment where even thresholds are unknown.

Paper Structure

This paper contains 60 sections, 27 theorems, 126 equations, 1 figure, 1 table, 3 algorithms.

Key Result

Theorem 1

Given a confidence parameter $\delta \in (0,1)$, with probability at least $1 - \delta$, the following holds for every constraint $i \in [m]$, step $h\in [H]$, episode $t \in [T]$, and state-action pair $(s,a) \in \mathcal{S} \times \mathcal{A}$: where $\zeta_h^{(W_t),t}(s,a)=\min\left\{1,\sqrt{\frac{4\ln(mSAHT/\delta)}{\max\{1,N_h^{(W_t),t-1}(s,a)\}}}\right\}$.

Figures (1)

  • Figure 1: Feasible Sets and their Corresponding Feasible Solutions

Theorems & Definitions (47)

  • Theorem 1: Asymptotic consistency of Growing-Window estimator to thresholds
  • Remark 1
  • Remark 2
  • Remark 3
  • Lemma 1: Strong duality under pessimistic thresholds
  • Lemma 2: Strong duality under optimistic thresholds
  • Theorem 2: Bounds for reward regret and constraint violation of pessimistic policy
  • Remark 4
  • Remark 5
  • Lemma 3: Policy optimality of pessimistic policy
  • ...and 37 more