A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

Toshinori Kitamura; Tadashi Kozuno; Masahiro Kato; Yuki Ichihara; Soichiro Nishimori; Akiyoshi Sannai; Sho Sonoda; Wataru Kumagai; Yutaka Matsuo

A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

Toshinori Kitamura, Tadashi Kozuno, Masahiro Kato, Yuki Ichihara, Soichiro Nishimori, Akiyoshi Sannai, Sho Sonoda, Wataru Kumagai, Yutaka Matsuo

TL;DR

A novel policy gradient PD algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy for any target accuracy is introduced.

Abstract

We study a primal-dual (PD) reinforcement learning (RL) algorithm for online constrained Markov decision processes (CMDPs). Despite its widespread practical use, the existing theoretical literature on PD-RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient PD algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem. In addition to the theoretical guarantees, we empirically demonstrate in a simple CMDP that our algorithm converges to optimal policies, while baseline algorithms exhibit oscillatory performance and constraint violation.

A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

TL;DR

Abstract

Paper Structure (39 sections, 28 theorems, 143 equations, 1 figure, 1 table, 3 algorithms)

This paper contains 39 sections, 28 theorems, 143 equations, 1 figure, 1 table, 3 algorithms.

Introduction
Preliminary
Constrained Markov Decision Processes.
Policy and Regularized Value Functions.
Learning Problem Setup
Performance Measure
The UOpt-RPGPD Algorithm
Regularized Lagrange function.
Uniform-PAC Exploration Bonus.
Adjust Regularization Coefficients and Learning Rate.
Uniform-PAC Analysis
Experiments
Conclusion
Limitation and Future Work.
Related Work
...and 24 more sections

Key Result

Theorem 2.3

Suppose an algorithm is Uniform-PAC for $\delta$ with $F_{\mathrm{UPAC}}(\cdots)=\widetilde{\mathcal{O}}*{C\varepsilon^{-\alpha}}$, where $C, \alpha > 0$ are constants independent of $\varepsilon$. Then, the algorithm

Figures (1)

Figure 1: Comparison of the algorithms described in \ref{['sec:experiments']}. Left: optimality gap ($\Delta_{\mathrm{opt}}^k$) and Right: constraint violation ($\Delta_{\mathrm{vio}}^k$).

Theorems & Definitions (51)

Definition 2.2: Uniform-PAC
Theorem 2.3
Lemma 3.1
Theorem 4.1
Corollary 4.2
Definition B.1: Regret
Definition B.2: $(\varepsilon, \delta)$-PAC
Remark B.3: Weak Regret Measures
Lemma F.1: Lemma 34 in efroni2020exploration
Lemma F.2: Error to regret
...and 41 more

A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

TL;DR

Abstract

A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (51)