Table of Contents
Fetching ...

Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs

Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, Alejandro Ribeiro

TL;DR

This work develops two single-time-scale, policy-gradient primal-dual methods for constrained MDPs, reformulating the CMDP via a Lagrangian $L(\pi,\lambda)=V_r^{\pi}(\rho)+\lambda V_g^{\pi}(\rho)$ and proving non-asymptotic, last-iterate convergence. RPG-PD adds entropy regularization and quadratic dual regularization to achieve sublinear last-iterate convergence (nearly dimension-free), while OPG-PD uses optimistic gradient updates to obtain a problem-dependent linear rate. Extensions to function approximation (e.g., linear/log-linear policies) show that convergence degrades gracefully to a neighborhood whose size depends on approximation error, with near-optimal policies achievable under small errors. The experimental results corroborate the theory, demonstrating reduced oscillations and robust last-iterate constraint satisfaction, and outperforming baseline primal/dual methods in stability and speed. Overall, the paper advances single-loop, last-iterate guarantees for constrained MDPs and provides practical, scalable algorithms for safe policy optimization.

Abstract

We study the problem of computing an optimal policy of an infinite-horizon discounted constrained Markov decision process (constrained MDP). Despite the popularity of Lagrangian-based policy search methods used in practice, the oscillation of policy iterates in these methods has not been fully understood, bringing out issues such as violation of constraints and sensitivity to hyper-parameters. To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in which max/min players correspond to primal/dual variables, respectively, and develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. Specifically, we first propose a regularized policy gradient primal-dual (RPG-PD) method that updates the policy using an entropy-regularized policy gradient, and the dual variable via a quadratic-regularized gradient ascent, simultaneously. We prove that the policy primal-dual iterates of RPG-PD converge to a regularized saddle point with a sublinear rate, while the policy iterates converge sublinearly to an optimal constrained policy. We further instantiate RPG-PD in large state or action spaces by including function approximation in policy parametrization, and establish similar sublinear last-iterate policy convergence. Second, we propose an optimistic policy gradient primal-dual (OPG-PD) method that employs the optimistic gradient method to update primal/dual variables, simultaneously. We prove that the policy primal-dual iterates of OPG-PD converge to a saddle point that contains an optimal constrained policy, with a linear rate. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.

Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs

TL;DR

This work develops two single-time-scale, policy-gradient primal-dual methods for constrained MDPs, reformulating the CMDP via a Lagrangian and proving non-asymptotic, last-iterate convergence. RPG-PD adds entropy regularization and quadratic dual regularization to achieve sublinear last-iterate convergence (nearly dimension-free), while OPG-PD uses optimistic gradient updates to obtain a problem-dependent linear rate. Extensions to function approximation (e.g., linear/log-linear policies) show that convergence degrades gracefully to a neighborhood whose size depends on approximation error, with near-optimal policies achievable under small errors. The experimental results corroborate the theory, demonstrating reduced oscillations and robust last-iterate constraint satisfaction, and outperforming baseline primal/dual methods in stability and speed. Overall, the paper advances single-loop, last-iterate guarantees for constrained MDPs and provides practical, scalable algorithms for safe policy optimization.

Abstract

We study the problem of computing an optimal policy of an infinite-horizon discounted constrained Markov decision process (constrained MDP). Despite the popularity of Lagrangian-based policy search methods used in practice, the oscillation of policy iterates in these methods has not been fully understood, bringing out issues such as violation of constraints and sensitivity to hyper-parameters. To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in which max/min players correspond to primal/dual variables, respectively, and develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. Specifically, we first propose a regularized policy gradient primal-dual (RPG-PD) method that updates the policy using an entropy-regularized policy gradient, and the dual variable via a quadratic-regularized gradient ascent, simultaneously. We prove that the policy primal-dual iterates of RPG-PD converge to a regularized saddle point with a sublinear rate, while the policy iterates converge sublinearly to an optimal constrained policy. We further instantiate RPG-PD in large state or action spaces by including function approximation in policy parametrization, and establish similar sublinear last-iterate policy convergence. Second, we propose an optimistic policy gradient primal-dual (OPG-PD) method that employs the optimistic gradient method to update primal/dual variables, simultaneously. We prove that the policy primal-dual iterates of OPG-PD converge to a saddle point that contains an optimal constrained policy, with a linear rate. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
Paper Structure (40 sections, 29 theorems, 190 equations, 17 figures, 1 table, 3 algorithms)

This paper contains 40 sections, 29 theorems, 190 equations, 17 figures, 1 table, 3 algorithms.

Key Result

Lemma 1

Let Assumption as:feasibility hold. Then, (i) strong duality holds for Problem eq:CMDP, i.e., $V_P^{\pi^\star}(\rho) = V_D^{\lambda^\star}(\rho)$; (ii) optimal dual variables are bounded, i.e., $\lambda^\star \in [ \, 0, (V_r^{\pi^\star} - V_r^{\bar{\pi}})/\xi \, ]$.

Figures (17)

  • Figure 1: Convergence performance of RPG-PD, OPG-PD, and primal-dual methods. Learning curves of our RPG-PD ($\small\textbf{\color{blue}-- --}$) and OPG-PD ($\small\textbf{\color{red}---}$), and NPG-PD ding2020natural ($\small\textbf{\color{cyan}--$\cdot$--}$) and PID Lagrangian stooke2020responsive ($\small\textbf{$\cdot$$\cdot$$\cdot$$\cdot$}$) methods. The horizontal axes mean the policy iterations $\{\pi_t\}_{t\,\geq\,0}$ that are generated by each method and the vertical axes mean the value functions of the policy iterates $\{\pi_t\}_{t\,\geq\,0}$: reward value $V_r^{\pi_t}(\rho)$ (Left) and utility value $V_g^{\pi_t}(\rho)$ (Right). In this experiment, we use the same stepsize $\eta = 0.1$ for all methods, the regularization parameter $\tau = 0.08$ for RPG-PD, and the uniform initial distribution $\rho$.
  • Figure 2: Convergence performance of OPG-PD with stepsize $\eta$: ($\eta=0.05$, $\small\textbf{$\cdot$$\cdot$$\cdot$$\cdot$}$), ($\eta=0.1$, $\small\textbf{\color{blue}--$\cdot$--}$), ($\eta=0.2$, $\small\textbf{\color{red}---}$). The horizontal axis represents the policy iterations $\{\pi_t\}_{t\,\geq\,0}$ that are generated by OPG-PD and the vertical axis means the policy optimality gap that measures the distance of the policy iterates $\{\pi_t\}_{t\,\geq\,0}$ to an optimal policy $\pi^\star$: $\sum_s \left\|\pi_t(\cdot\,\vert\,s) - \pi^\star(\cdot\,\vert\,s)\right\|^2$. In this experiment, we take the initial distribution $\rho$ to be a uniform one.
  • Figure 3: An example of a constrained MDP that has the objective function $V_r^{\pi}(\rho)$ and the constraint set $\{\pi \in\Pi\,\vert\,V_g^{\pi}(\rho)\geq 0\}$. The pair $(a, r,g)$ associated with a directed arrow represents $(\text{reward, utility})$ received when an action $a$ at a certain state is taken.
  • Figure 4: Convergence performance of RPG-PD, OPG-PD, and dual methods. Learning curves of our RPG-PD ($\small\textbf{\color{blue}-- --}$) and OPG-PD ($\small\textbf{\color{red}---}$), and PMD-PD liu2021policy ($\small\textbf{\color{cyan}-- --}$), AR-CPO li2021faster ($\small\textbf{\color{magenta}--$\cdot$--}$), and Accelerated Dual ying2022dual ($\small\textbf{$\cdot$$\cdot$$\cdot$$\cdot$}$) methods. The horizontal axes represent the policy iterations $\{\pi_t\}_{t\,\geq\,0}$ that are generated by each method and the vertical axes mean the value functions of the policy iterates $\{\pi_t\}_{t\,\geq\,0}$: reward value $V_r^{\pi_t}(\rho)$ (Left) and utility value $V_g^{\pi_t}(\rho)$ (Right). In this experiment, for RPG-PD and OPG-PD, we use the stepsize $\eta = 0.1$ and the regularization parameter $\tau = 0.08$ for RPG-PD, and the initial distribution $\rho$ is uniform. For PMD-PD, AR-CPO, and Accelerated Dual, we use the stepsize $\eta = 0.1$ for the dual update, the regularized NPG stepsize $\alpha = 1$, and the regularization parameter $\tau = 0.08$, and the uniform initial distribution $\rho$.
  • Figure 5: Convergence performance of RPG-PD, OPG-PD, and primal methods. Learning curves of our RPG-PD ($\small\textbf{\color{blue}-- --}$) and OPG-PD ($\small\textbf{\color{red}---}$), and CRPO xu2021crpo ($\small\textbf{\color{black}--$\cdot$--}$) methods. The horizontal axes represent the policy iterations $\{\pi_t\}_{t\,\geq\,0}$ that are generated by each method and the vertical axes mean the value functions of the policy iterates $\{\pi_t\}_{t\,\geq\,0}$: reward value $V_r^{\pi_t}(\rho)$ (Left) and utility value $V_g^{\pi_t}(\rho)$ (Right). In this experiment, for RPG-PD and OPG-PD, we use the stepsize $\eta = 0.1$ and the regularization parameter $\tau = 0.08$ for RPG-PD, and the initial distribution $\rho$ is uniform. For CRPO, we update the policy via the NPG step with stepsize $\eta = 0.1$ and the uniform initial distribution $\rho$.
  • ...and 12 more figures

Theorems & Definitions (55)

  • Lemma 1: Strong duality/Saddle point existence and boundedness
  • Theorem 2: Linear convergence of RPG-PD
  • Corollary 3: Nearly-optimal constrained policy
  • Theorem 4: Linear convergence of inexact RPG-PD
  • Corollary 5: Nearly-optimal constrained policy
  • Theorem 6: Linear convergence of OPG-PD
  • Corollary 7: Nearly-optimal constrained policy
  • Lemma 8: Invariance of saddle points
  • proof
  • Lemma 9: Interchangeability of saddle points
  • ...and 45 more