Table of Contents
Fetching ...

A safe exploration approach to constrained Markov decision processes

Tingting Ni, Maryam Kamgarpour

TL;DR

Unlike existing CMDP approaches that ensure policy feasibility only upon convergence, the LB-SGD algorithm guarantees feasibility throughout the learning process and converges to the $\varepsilon$-optimal policy with a sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-6})$.

Abstract

We consider discounted infinite-horizon constrained Markov decision processes (CMDPs), where the goal is to find an optimal policy that maximizes the expected cumulative reward while satisfying expected cumulative constraints. Motivated by the application of CMDPs in online learning for safety-critical systems, we focus on developing a model-free and \emph{simulator-free} algorithm that ensures \emph{constraint satisfaction during learning}. To this end, we employ the LB-SGD algorithm proposed in \cite{usmanova2022log}, which utilizes an interior-point approach based on the log-barrier function of the CMDP. Under the commonly assumed conditions of relaxed Fisher non-degeneracy and bounded transfer error in policy parameterization, we establish the theoretical properties of the LB-SGD algorithm. In particular, unlike existing CMDP approaches that ensure policy feasibility only upon convergence, the LB-SGD algorithm guarantees feasibility throughout the learning process and converges to the $\varepsilon$-optimal policy with a sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-6})$. Compared to the state-of-the-art policy gradient-based algorithm, C-NPG-PDA \cite{bai2022achieving2}, the LB-SGD algorithm requires an additional $\mathcal{O}(\varepsilon^{-2})$ samples to ensure policy feasibility during learning with the same Fisher non-degenerate parameterization.

A safe exploration approach to constrained Markov decision processes

TL;DR

Unlike existing CMDP approaches that ensure policy feasibility only upon convergence, the LB-SGD algorithm guarantees feasibility throughout the learning process and converges to the -optimal policy with a sample complexity of .

Abstract

We consider discounted infinite-horizon constrained Markov decision processes (CMDPs), where the goal is to find an optimal policy that maximizes the expected cumulative reward while satisfying expected cumulative constraints. Motivated by the application of CMDPs in online learning for safety-critical systems, we focus on developing a model-free and \emph{simulator-free} algorithm that ensures \emph{constraint satisfaction during learning}. To this end, we employ the LB-SGD algorithm proposed in \cite{usmanova2022log}, which utilizes an interior-point approach based on the log-barrier function of the CMDP. Under the commonly assumed conditions of relaxed Fisher non-degeneracy and bounded transfer error in policy parameterization, we establish the theoretical properties of the LB-SGD algorithm. In particular, unlike existing CMDP approaches that ensure policy feasibility only upon convergence, the LB-SGD algorithm guarantees feasibility throughout the learning process and converges to the -optimal policy with a sample complexity of . Compared to the state-of-the-art policy gradient-based algorithm, C-NPG-PDA \cite{bai2022achieving2}, the LB-SGD algorithm requires an additional samples to ensure policy feasibility during learning with the same Fisher non-degenerate parameterization.
Paper Structure (38 sections, 21 theorems, 131 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 38 sections, 21 theorems, 131 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Proposition 3.3

Let Assumption smoli hold. The following properties hold $\forall i\in\{0,\dots,m\}$ and $\forall \theta\in\Theta$. 1. $V_i^\theta(\rho)$ are $L$-Lipschitz continuous and $M$-smooth, where $L:=\frac{M_g}{(1-\gamma)^2}$ and $M:=\frac{M_g^2+M_h}{(1-\gamma)^2}$. 2. Let $b^0(H):=\frac{\gamma^{H}}{1-\gam 3. Let $\sigma^0(n) := \frac{\sqrt{2}}{\sqrt{n}(1-\gamma)}$ and $\sigma^1(n) := \frac{2\sqrt{2}M_g}

Figures (3)

  • Figure 1: Gridworld Environment: The green block denotes the reward, the arrows represent the policies, and the two red-hatched rectangles indicate the constrained states.
  • Figure 2: The gradient estimation error for the log barrier function and the value of stepsize for LB-SGD algorithm with sample sizes of 100, 300, 500, 700, 900, 1500, and 3000 at varying distances from the boundary. In the figure above, the lines indicate the median values obtained from 10 independent experiments, while the shaded areas represent the 10% and 90% percentiles calculated from 10 different random seeds.
  • Figure 3: The average performance comparison between the IPO algorithm liu2020ipo with stepsizes $\gamma_t = 0.5, 1, 1.5$, RPG-PD using a regularization parameter $\tau = 0.1$ and stepsize chosen according to ding2024last with $b = 0.1$, and LB-SGD is shown below. The lines represent the median values from 10 independent experiments, while the shaded areas illustrate the 10% and 90% percentiles calculated from 10 different random seeds.

Theorems & Definitions (43)

  • Definition 2.2
  • Remark 3.2
  • Proposition 3.3
  • Lemma 3.4
  • Theorem 3.5
  • Proposition 4.3
  • Definition 4.5: Fisher non-degeneracy
  • Definition 4.7: Richness of Policy Parameterization
  • Lemma 4.10
  • Lemma 4.11
  • ...and 33 more