Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction
Donghao Ying, Mengzi Amy Guo, Hyunin Lee, Yuhao Ding, Javad Lavaei, Zuo-Jun Max Shen
TL;DR
This work introduces VR-PDPG, a variance-reduced policy gradient method for Concave CMDPs where both objectives and constraints are concave in the occupancy measure. By leveraging a hidden concavity through local invertibility of the occupancy-to-parameter map and a Slater-conditioned strong duality, the authors derive a PDPG framework and its variance-reduced variant for the sample-based setting. They establish global convergence for both general and strongly concave cases, with rates of O(T−1/3) (general) and O(T−1/2) (strongly concave) in the exact setting, and a sample complexity of Õ(ε−4) for ε-global optimality in the stochastic setting; they also show that a diminishing pessimism term yields zero constraint violation without sacrificing convergence. The zero-violation technique and variance-reduction yield practical safety guarantees, and numerical experiments on gridworlds corroborate improved performance and sample efficiency relative to baselines. The work advances safe RL by extending policy-based PDPG methods to general concave utilities and multiple safety constraints, with substantial theoretical and empirical validation.
Abstract
We study Concave Constrained Markov Decision Processes (Concave CMDPs) where both the objective and constraints are defined as concave functions of the state-action occupancy measure. We propose the Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG), which updates the primal variable via policy gradient ascent and the dual variable via projected sub-gradient descent. Despite the challenges posed by the loss of additivity structure and the nonconcave nature of the problem, we establish the global convergence of VR-PDPG by exploiting a form of hidden concavity. In the exact setting, we prove an $O(T^{-1/3})$ convergence rate for both the average optimality gap and constraint violation, which further improves to $O(T^{-1/2})$ under strong concavity of the objective in the occupancy measure. In the sample-based setting, we demonstrate that VR-PDPG achieves an $\widetilde{O}(ε^{-4})$ sample complexity for $ε$-global optimality. Moreover, by incorporating a diminishing pessimistic term into the constraint, we show that VR-PDPG can attain a zero constraint violation without compromising the convergence rate of the optimality gap. Finally, we validate the effectiveness of our methods through numerical experiments.
