Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs
Washim Uddin Mondal, Vaneet Aggarwal
TL;DR
This work addresses learning in Constrained Markov Decision Processes with general parameterized policies and aims to obtain last-iterate convergence guarantees rather than average-case guarantees. It introduces the Primal-Dual Regularized Accelerated Natural Policy Gradient (PDR-ANPG), which uses entropy regularization and a quadratic dual term within a regularized Lagrangian, together with a variance-reduced gradient estimator and an ASGD-based inner loop, to achieve a last-iterate optimality gap of $O(\epsilon+\epsilon_{\mathrm{bias}}^{1/6})$ and the same-order constraint violation, with a sample complexity of $\tilde{O}(\epsilon^{-2}\min\{\epsilon^{-2}, \epsilon_{\mathrm{bias}}^{-1/3}\})$. The results explicitly show how policy expressivity error $\epsilon_{\mathrm{bias}}$ governs the attainable accuracy and sample cost, with complete policy classes ($\epsilon_{\mathrm{bias}}=0$) yielding an $O(\epsilon)$ gap and $\tilde{O}(\epsilon^{-4})$ samples, and incomplete classes allowing $\tilde{O}(\epsilon^{-2})$ when $\epsilon$ is sufficiently small. The analysis connects the NPG estimator bias to ASGD convergence and leverages a novel sampling scheme to bound average advantages under entropy regularization. Overall, the paper provides a substantial improvement over prior last-iterate guarantees for general parameterized CMDPs and clarifies the role of policy expressivity in achieving tight, safety-conscious convergence guarantees.
Abstract
We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, $ε_{\mathrm{bias}}$, PDR-ANPG achieves a last-iterate $ε$ optimality gap and $ε$ constraint violation (up to some additive factor of $ε_{\mathrm{bias}}$) with a sample complexity of $\tilde{\mathcal{O}}(ε^{-2}\min\{ε^{-2},ε_{\mathrm{bias}}^{-\frac{1}{3}}\})$. If the class is incomplete ($ε_{\mathrm{bias}}>0$), then the sample complexity reduces to $\tilde{\mathcal{O}}(ε^{-2})$ for $ε<(ε_{\mathrm{bias}})^{\frac{1}{6}}$. Moreover, for complete policies with $ε_{\mathrm{bias}}=0$, our algorithm achieves a last-iterate $ε$ optimality gap and $ε$ constraint violation with $\tilde{\mathcal{O}}(ε^{-4})$ sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.
