A safe exploration approach to constrained Markov decision processes

Tingting Ni; Maryam Kamgarpour

A safe exploration approach to constrained Markov decision processes

Tingting Ni, Maryam Kamgarpour

TL;DR

Unlike existing CMDP approaches that ensure policy feasibility only upon convergence, the LB-SGD algorithm guarantees feasibility throughout the learning process and converges to the $\varepsilon$-optimal policy with a sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-6})$.

Abstract

We consider discounted infinite-horizon constrained Markov decision processes (CMDPs), where the goal is to find an optimal policy that maximizes the expected cumulative reward while satisfying expected cumulative constraints. Motivated by the application of CMDPs in online learning for safety-critical systems, we focus on developing a model-free and \emph{simulator-free} algorithm that ensures \emph{constraint satisfaction during learning}. To this end, we employ the LB-SGD algorithm proposed in \cite{usmanova2022log}, which utilizes an interior-point approach based on the log-barrier function of the CMDP. Under the commonly assumed conditions of relaxed Fisher non-degeneracy and bounded transfer error in policy parameterization, we establish the theoretical properties of the LB-SGD algorithm. In particular, unlike existing CMDP approaches that ensure policy feasibility only upon convergence, the LB-SGD algorithm guarantees feasibility throughout the learning process and converges to the $\varepsilon$-optimal policy with a sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-6})$. Compared to the state-of-the-art policy gradient-based algorithm, C-NPG-PDA \cite{bai2022achieving2}, the LB-SGD algorithm requires an additional $\mathcal{O}(\varepsilon^{-2})$ samples to ensure policy feasibility during learning with the same Fisher non-degenerate parameterization.

A safe exploration approach to constrained Markov decision processes

TL;DR

Unlike existing CMDP approaches that ensure policy feasibility only upon convergence, the LB-SGD algorithm guarantees feasibility throughout the learning process and converges to the

-optimal policy with a sample complexity of

Abstract

-optimal policy with a sample complexity of

. Compared to the state-of-the-art policy gradient-based algorithm, C-NPG-PDA \cite{bai2022achieving2}, the LB-SGD algorithm requires an additional

samples to ensure policy feasibility during learning with the same Fisher non-degenerate parameterization.

Paper Structure (38 sections, 21 theorems, 131 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 38 sections, 21 theorems, 131 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Contributions
Notations
Problem formulation
Log barrier policy gradient approach
Estimating the log barrier gradient
Tuning the stepsize
Technical analysis of log barrier for CMDPs
Safe exploration of the algorithm
Convergence and sample complexity
Computational Experiment
Conclusion
Acknowledgments
Comparison of model-free safe RL algorithms
Discussion on Assumption \ref{['emf']}
...and 23 more sections

Key Result

Proposition 3.3

Let Assumption smoli hold. The following properties hold $\forall i\in\{0,\dots,m\}$ and $\forall \theta\in\Theta$. 1. $V_i^\theta(\rho)$ are $L$-Lipschitz continuous and $M$-smooth, where $L:=\frac{M_g}{(1-\gamma)^2}$ and $M:=\frac{M_g^2+M_h}{(1-\gamma)^2}$. 2. Let $b^0(H):=\frac{\gamma^{H}}{1-\gam 3. Let $\sigma^0(n) := \frac{\sqrt{2}}{\sqrt{n}(1-\gamma)}$ and $\sigma^1(n) := \frac{2\sqrt{2}M_g}

Figures (3)

Figure 1: Gridworld Environment: The green block denotes the reward, the arrows represent the policies, and the two red-hatched rectangles indicate the constrained states.
Figure 2: The gradient estimation error for the log barrier function and the value of stepsize for LB-SGD algorithm with sample sizes of 100, 300, 500, 700, 900, 1500, and 3000 at varying distances from the boundary. In the figure above, the lines indicate the median values obtained from 10 independent experiments, while the shaded areas represent the 10% and 90% percentiles calculated from 10 different random seeds.
Figure 3: The average performance comparison between the IPO algorithm liu2020ipo with stepsizes $\gamma_t = 0.5, 1, 1.5$, RPG-PD using a regularization parameter $\tau = 0.1$ and stepsize chosen according to ding2024last with $b = 0.1$, and LB-SGD is shown below. The lines represent the median values from 10 independent experiments, while the shaded areas illustrate the 10% and 90% percentiles calculated from 10 different random seeds.

Theorems & Definitions (43)

Definition 2.2
Remark 3.2
Proposition 3.3
Lemma 3.4
Theorem 3.5
Proposition 4.3
Definition 4.5: Fisher non-degeneracy
Definition 4.7: Richness of Policy Parameterization
Lemma 4.10
Lemma 4.11
...and 33 more

A safe exploration approach to constrained Markov decision processes

TL;DR

Abstract

A safe exploration approach to constrained Markov decision processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (43)