Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Washim Uddin Mondal; Vaneet Aggarwal

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

This work addresses learning in Constrained Markov Decision Processes with general parameterized policies and aims to obtain last-iterate convergence guarantees rather than average-case guarantees. It introduces the Primal-Dual Regularized Accelerated Natural Policy Gradient (PDR-ANPG), which uses entropy regularization and a quadratic dual term within a regularized Lagrangian, together with a variance-reduced gradient estimator and an ASGD-based inner loop, to achieve a last-iterate optimality gap of $O(\epsilon+\epsilon_{\mathrm{bias}}^{1/6})$ and the same-order constraint violation, with a sample complexity of $\tilde{O}(\epsilon^{-2}\min\{\epsilon^{-2}, \epsilon_{\mathrm{bias}}^{-1/3}\})$. The results explicitly show how policy expressivity error $\epsilon_{\mathrm{bias}}$ governs the attainable accuracy and sample cost, with complete policy classes ($\epsilon_{\mathrm{bias}}=0$) yielding an $O(\epsilon)$ gap and $\tilde{O}(\epsilon^{-4})$ samples, and incomplete classes allowing $\tilde{O}(\epsilon^{-2})$ when $\epsilon$ is sufficiently small. The analysis connects the NPG estimator bias to ASGD convergence and leverages a novel sampling scheme to bound average advantages under entropy regularization. Overall, the paper provides a substantial improvement over prior last-iterate guarantees for general parameterized CMDPs and clarifies the role of policy expressivity in achieving tight, safety-conscious convergence guarantees.

Abstract

We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, $ε_{\mathrm{bias}}$, PDR-ANPG achieves a last-iterate $ε$ optimality gap and $ε$ constraint violation (up to some additive factor of $ε_{\mathrm{bias}}$) with a sample complexity of $\tilde{\mathcal{O}}(ε^{-2}\min\{ε^{-2},ε_{\mathrm{bias}}^{-\frac{1}{3}}\})$. If the class is incomplete ($ε_{\mathrm{bias}}>0$), then the sample complexity reduces to $\tilde{\mathcal{O}}(ε^{-2})$ for $ε<(ε_{\mathrm{bias}})^{\frac{1}{6}}$. Moreover, for complete policies with $ε_{\mathrm{bias}}=0$, our algorithm achieves a last-iterate $ε$ optimality gap and $ε$ constraint violation with $\tilde{\mathcal{O}}(ε^{-4})$ sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

TL;DR

and the same-order constraint violation, with a sample complexity of

. The results explicitly show how policy expressivity error

governs the attainable accuracy and sample cost, with complete policy classes (

) yielding an

gap and

samples, and incomplete classes allowing

when

is sufficiently small. The analysis connects the NPG estimator bias to ASGD convergence and leverages a novel sampling scheme to bound average advantages under entropy regularization. Overall, the paper provides a substantial improvement over prior last-iterate guarantees for general parameterized CMDPs and clarifies the role of policy expressivity in achieving tight, safety-conscious convergence guarantees.

Abstract

, PDR-ANPG achieves a last-iterate

optimality gap and

constraint violation (up to some additive factor of

) with a sample complexity of

. If the class is incomplete (

), then the sample complexity reduces to

for

. Moreover, for complete policies with

, our algorithm achieves a last-iterate

optimality gap and

constraint violation with

sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.

Paper Structure (20 sections, 12 theorems, 99 equations, 1 table, 1 algorithm)

This paper contains 20 sections, 12 theorems, 99 equations, 1 table, 1 algorithm.

Introduction
Contribution and Challenges
Related Works
Formulation
Algorithm Design
Last-Iterate Convergence Analysis
Analysis of the Outer Loop
Analysis of the Inner Loop
Optimality Gap and Constraint Violation
Conclusion
Proof of Lemma \ref{['lemma:grad_compute']}
Proof of Lemma \ref{['lemma:advantage_bound']}
Proof of Lemma \ref{['lemma_unbiased']}
Proof of Lemma \ref{['lemma_recursion_phi_k']}
Proof of Lemma \ref{['lemma_npg_variance']}
...and 5 more sections

Key Result

Lemma 1

An optimal primal-dual pair $(\pi^*, \lambda^*)$ is guaranteed to exist if Assumption ass_slater holds. Moreover, it satisfies the following strong duality condition. Additionally, $0\leq \lambda^*\leq 1/[(1-\gamma)c_{\mathrm{slat}}]$.

Theorems & Definitions (13)

Lemma 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5
proof
Lemma 6
Lemma 7
Lemma 8
Corollary 1
...and 3 more

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

TL;DR

Abstract

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (13)