Table of Contents
Fetching ...

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

This work addresses learning in Constrained Markov Decision Processes with general parameterized policies and aims to obtain last-iterate convergence guarantees rather than average-case guarantees. It introduces the Primal-Dual Regularized Accelerated Natural Policy Gradient (PDR-ANPG), which uses entropy regularization and a quadratic dual term within a regularized Lagrangian, together with a variance-reduced gradient estimator and an ASGD-based inner loop, to achieve a last-iterate optimality gap of $O(\epsilon+\epsilon_{\mathrm{bias}}^{1/6})$ and the same-order constraint violation, with a sample complexity of $\tilde{O}(\epsilon^{-2}\min\{\epsilon^{-2}, \epsilon_{\mathrm{bias}}^{-1/3}\})$. The results explicitly show how policy expressivity error $\epsilon_{\mathrm{bias}}$ governs the attainable accuracy and sample cost, with complete policy classes ($\epsilon_{\mathrm{bias}}=0$) yielding an $O(\epsilon)$ gap and $\tilde{O}(\epsilon^{-4})$ samples, and incomplete classes allowing $\tilde{O}(\epsilon^{-2})$ when $\epsilon$ is sufficiently small. The analysis connects the NPG estimator bias to ASGD convergence and leverages a novel sampling scheme to bound average advantages under entropy regularization. Overall, the paper provides a substantial improvement over prior last-iterate guarantees for general parameterized CMDPs and clarifies the role of policy expressivity in achieving tight, safety-conscious convergence guarantees.

Abstract

We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, $ε_{\mathrm{bias}}$, PDR-ANPG achieves a last-iterate $ε$ optimality gap and $ε$ constraint violation (up to some additive factor of $ε_{\mathrm{bias}}$) with a sample complexity of $\tilde{\mathcal{O}}(ε^{-2}\min\{ε^{-2},ε_{\mathrm{bias}}^{-\frac{1}{3}}\})$. If the class is incomplete ($ε_{\mathrm{bias}}>0$), then the sample complexity reduces to $\tilde{\mathcal{O}}(ε^{-2})$ for $ε<(ε_{\mathrm{bias}})^{\frac{1}{6}}$. Moreover, for complete policies with $ε_{\mathrm{bias}}=0$, our algorithm achieves a last-iterate $ε$ optimality gap and $ε$ constraint violation with $\tilde{\mathcal{O}}(ε^{-4})$ sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

TL;DR

This work addresses learning in Constrained Markov Decision Processes with general parameterized policies and aims to obtain last-iterate convergence guarantees rather than average-case guarantees. It introduces the Primal-Dual Regularized Accelerated Natural Policy Gradient (PDR-ANPG), which uses entropy regularization and a quadratic dual term within a regularized Lagrangian, together with a variance-reduced gradient estimator and an ASGD-based inner loop, to achieve a last-iterate optimality gap of and the same-order constraint violation, with a sample complexity of . The results explicitly show how policy expressivity error governs the attainable accuracy and sample cost, with complete policy classes () yielding an gap and samples, and incomplete classes allowing when is sufficiently small. The analysis connects the NPG estimator bias to ASGD convergence and leverages a novel sampling scheme to bound average advantages under entropy regularization. Overall, the paper provides a substantial improvement over prior last-iterate guarantees for general parameterized CMDPs and clarifies the role of policy expressivity in achieving tight, safety-conscious convergence guarantees.

Abstract

We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, , PDR-ANPG achieves a last-iterate optimality gap and constraint violation (up to some additive factor of ) with a sample complexity of . If the class is incomplete (), then the sample complexity reduces to for . Moreover, for complete policies with , our algorithm achieves a last-iterate optimality gap and constraint violation with sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.
Paper Structure (20 sections, 12 theorems, 99 equations, 1 table, 1 algorithm)

This paper contains 20 sections, 12 theorems, 99 equations, 1 table, 1 algorithm.

Key Result

Lemma 1

An optimal primal-dual pair $(\pi^*, \lambda^*)$ is guaranteed to exist if Assumption ass_slater holds. Moreover, it satisfies the following strong duality condition. Additionally, $0\leq \lambda^*\leq 1/[(1-\gamma)c_{\mathrm{slat}}]$.

Theorems & Definitions (13)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • proof
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • Corollary 1
  • ...and 3 more