Table of Contents
Fetching ...

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

Dongsheng Ding, Kaiqing Zhang, Jiali Duan, Tamer Başar, Mihailo R. Jovanović

TL;DR

This work studies constrained Markov decision processes in the discounted infinite-horizon setting and develops a Natural Policy Gradient Primal-Dual (NPG-PD) algorithm that updates the primal policy via natural gradient ascent and the dual via projected subgradient descent. Theoretical results establish global, dimension-free convergence with rate $O(1/\,\sqrt{T})$ for the tabular softmax case under Slater conditions, and extend to log-linear and general smooth policy classes with sublinear rates up to a function-approximation error. Finite-sample guarantees are provided for two model-free variants, along with empirical validation on robotic tasks that demonstrate robust constraint satisfaction and competitive performance. The findings advance understanding of global finite-time convergence in nonconvex constrained MDP settings and inform the design of efficient, sample-efficient constrained RL algorithms.

Abstract

We study the sequential decision making problem of maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected subgradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. We use a set of computational experiments to showcase the effectiveness of our approach.

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

TL;DR

This work studies constrained Markov decision processes in the discounted infinite-horizon setting and develops a Natural Policy Gradient Primal-Dual (NPG-PD) algorithm that updates the primal policy via natural gradient ascent and the dual via projected subgradient descent. Theoretical results establish global, dimension-free convergence with rate for the tabular softmax case under Slater conditions, and extend to log-linear and general smooth policy classes with sublinear rates up to a function-approximation error. Finite-sample guarantees are provided for two model-free variants, along with empirical validation on robotic tasks that demonstrate robust constraint satisfaction and competitive performance. The findings advance understanding of global finite-time convergence in nonconvex constrained MDP settings and inform the design of efficient, sample-efficient constrained RL algorithms.

Abstract

We study the sequential decision making problem of maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected subgradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. We use a set of computational experiments to showcase the effectiveness of our approach.
Paper Structure (34 sections, 19 theorems, 213 equations, 4 figures, 2 tables, 5 algorithms)

This paper contains 34 sections, 19 theorems, 213 equations, 4 figures, 2 tables, 5 algorithms.

Key Result

Lemma 3

Let Assumption as.slater hold. Then

Figures (4)

  • Figure 1: An example of a constrained MDP for which the objective function $V_r^{\pi_\theta}(s)$ in Problem \ref{['eq.cmdp.p']} is not concave and the constraint set $\{\theta \in\Theta\,\vert\,V_g^{\pi_\theta}(s)\geq b\}$ is not convex. A pair $(r,g)$ associated with a directed arrow represents the $(\text{reward, utility})$ received when an action in a certain state is taken. This example is utilized in the proof of Lemma \ref{['lem.nonconvex']}.
  • Figure 2: Learning curves of NPG-PD method (---, blue), CUP yang2022constrained (---, red), FOCOPS zhang2020first (---, orange), TRPOLag ray2019benchmarking (---, black), and PPOLag ray2019benchmarking (---, green) for Ant-v1 and Humanoid-v1 robotic tasks with the speed limit $25$. The vertical axes represent the average reward and the average cost (i.e., average speed). The solid lines show the means of $1000$ bootstrap samples obtained over $3$ random seeds and the shaded regions display the bootstrap $95\%$ confidence intervals.
  • Figure 3: Learning curves of NPG-PD method (---, blue), CUP yang2022constrained (---, red), FOCOPS zhang2020first (---, orange), TRPOLag ray2019benchmarking (---, black), and PPOLag ray2019benchmarking (---, green) for HalfCheetah-v1 and Walker2d-v1 robotic tasks with the speed limit $25$. The vertical axes represent the average reward and the average cost (i.e., average speed). The solid lines show the means of $1000$ bootstrap samples obtained over $3$ random seeds and the shaded regions display the bootstrap $95\%$ confidence intervals.
  • Figure 4: Learning curves of NPG-PD method (---, blue), CUP yang2022constrained (---, red), FOCOPS zhang2020first (---, orange), TRPOLag ray2019benchmarking (---, black), and PPOLag ray2019benchmarking (---, green) for Hopper-v1 and Swimmer-v1 robotic tasks with the speed limit $25$. The vertical axes represent the average reward and the average cost (i.e., average speed). The solid lines show the means of $1000$ bootstrap samples obtained over $3$ random seeds and the shaded regions display the bootstrap $95\%$ confidence intervals.

Theorems & Definitions (25)

  • Remark 1
  • Lemma 3: Strong duality and boundedness of $\lambda^\star$
  • Remark 4
  • Lemma 5: Constraint violation
  • Lemma 6: Lack of convexity
  • Theorem 7: Restrictive convergence: direct policy parametrization
  • Remark 8
  • Lemma 9: Primal update as MWU
  • Theorem 10: Global convergence: softmax policy parametrization
  • Lemma 11: Non-monotonic improvement
  • ...and 15 more