Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs
Dongsheng Ding, Kaiqing Zhang, Jiali Duan, Tamer Başar, Mihailo R. Jovanović
TL;DR
This work studies constrained Markov decision processes in the discounted infinite-horizon setting and develops a Natural Policy Gradient Primal-Dual (NPG-PD) algorithm that updates the primal policy via natural gradient ascent and the dual via projected subgradient descent. Theoretical results establish global, dimension-free convergence with rate $O(1/\,\sqrt{T})$ for the tabular softmax case under Slater conditions, and extend to log-linear and general smooth policy classes with sublinear rates up to a function-approximation error. Finite-sample guarantees are provided for two model-free variants, along with empirical validation on robotic tasks that demonstrate robust constraint satisfaction and competitive performance. The findings advance understanding of global finite-time convergence in nonconvex constrained MDP settings and inform the design of efficient, sample-efficient constrained RL algorithms.
Abstract
We study the sequential decision making problem of maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected subgradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. We use a set of computational experiments to showcase the effectiveness of our approach.
