Table of Contents
Fetching ...

Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction

Donghao Ying, Mengzi Amy Guo, Hyunin Lee, Yuhao Ding, Javad Lavaei, Zuo-Jun Max Shen

TL;DR

This work introduces VR-PDPG, a variance-reduced policy gradient method for Concave CMDPs where both objectives and constraints are concave in the occupancy measure. By leveraging a hidden concavity through local invertibility of the occupancy-to-parameter map and a Slater-conditioned strong duality, the authors derive a PDPG framework and its variance-reduced variant for the sample-based setting. They establish global convergence for both general and strongly concave cases, with rates of O(T−1/3) (general) and O(T−1/2) (strongly concave) in the exact setting, and a sample complexity of Õ(ε−4) for ε-global optimality in the stochastic setting; they also show that a diminishing pessimism term yields zero constraint violation without sacrificing convergence. The zero-violation technique and variance-reduction yield practical safety guarantees, and numerical experiments on gridworlds corroborate improved performance and sample efficiency relative to baselines. The work advances safe RL by extending policy-based PDPG methods to general concave utilities and multiple safety constraints, with substantial theoretical and empirical validation.

Abstract

We study Concave Constrained Markov Decision Processes (Concave CMDPs) where both the objective and constraints are defined as concave functions of the state-action occupancy measure. We propose the Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG), which updates the primal variable via policy gradient ascent and the dual variable via projected sub-gradient descent. Despite the challenges posed by the loss of additivity structure and the nonconcave nature of the problem, we establish the global convergence of VR-PDPG by exploiting a form of hidden concavity. In the exact setting, we prove an $O(T^{-1/3})$ convergence rate for both the average optimality gap and constraint violation, which further improves to $O(T^{-1/2})$ under strong concavity of the objective in the occupancy measure. In the sample-based setting, we demonstrate that VR-PDPG achieves an $\widetilde{O}(ε^{-4})$ sample complexity for $ε$-global optimality. Moreover, by incorporating a diminishing pessimistic term into the constraint, we show that VR-PDPG can attain a zero constraint violation without compromising the convergence rate of the optimality gap. Finally, we validate the effectiveness of our methods through numerical experiments.

Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction

TL;DR

This work introduces VR-PDPG, a variance-reduced policy gradient method for Concave CMDPs where both objectives and constraints are concave in the occupancy measure. By leveraging a hidden concavity through local invertibility of the occupancy-to-parameter map and a Slater-conditioned strong duality, the authors derive a PDPG framework and its variance-reduced variant for the sample-based setting. They establish global convergence for both general and strongly concave cases, with rates of O(T−1/3) (general) and O(T−1/2) (strongly concave) in the exact setting, and a sample complexity of Õ(ε−4) for ε-global optimality in the stochastic setting; they also show that a diminishing pessimism term yields zero constraint violation without sacrificing convergence. The zero-violation technique and variance-reduction yield practical safety guarantees, and numerical experiments on gridworlds corroborate improved performance and sample efficiency relative to baselines. The work advances safe RL by extending policy-based PDPG methods to general concave utilities and multiple safety constraints, with substantial theoretical and empirical validation.

Abstract

We study Concave Constrained Markov Decision Processes (Concave CMDPs) where both the objective and constraints are defined as concave functions of the state-action occupancy measure. We propose the Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG), which updates the primal variable via policy gradient ascent and the dual variable via projected sub-gradient descent. Despite the challenges posed by the loss of additivity structure and the nonconcave nature of the problem, we establish the global convergence of VR-PDPG by exploiting a form of hidden concavity. In the exact setting, we prove an convergence rate for both the average optimality gap and constraint violation, which further improves to under strong concavity of the objective in the occupancy measure. In the sample-based setting, we demonstrate that VR-PDPG achieves an sample complexity for -global optimality. Moreover, by incorporating a diminishing pessimistic term into the constraint, we show that VR-PDPG can attain a zero constraint violation without compromising the convergence rate of the optimality gap. Finally, we validate the effectiveness of our methods through numerical experiments.
Paper Structure (41 sections, 23 theorems, 200 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 41 sections, 23 theorems, 200 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Lemma 2.1

\newlabellemma:duality0 Let Assumption assump:slater hold, and suppose that $\operatorname{cl}(\{\lambda(\theta)\vert \theta\in \mathbb{R}^K\}) = \Lambda$, where $\operatorname{cl}(\cdot)$ denotes the closure operation. Then, we have that Hence, it suffices to search the optimal dual variable within the closed interval $U:=[0, C_0]$, where $C_0 := 1+(M-f(\lambda(\widetilde{\theta})))/{\xi}$ and

Figures (2)

  • Figure 1: Two instances of $8 \times 8$ gridworld experiments under different reference trajectories. For $\texttt{k}\in \texttt{\{1,2\}}$, trajk, trajk($\alpha_t=1$), and trajk(Avg) correspond to VR-PDPG, VR-PDPG with $\alpha_t=1$, and VR-PDPG with $\alpha_t=1$ and multi-trajectory estimation, respectively.
  • Figure 2: Two instances of $20 \times 20$ gridworld experiments under different reference trajectories.

Theorems & Definitions (45)

  • Example 1: Safe learning
  • Example 2: Safe exploration
  • Example 3: Safety-aware apprenticeship learning (AL)
  • Example 4: Feasibility constrained MDPs
  • Lemma 2.1: Strong duality and boundedness of $\mu^\star$
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3
  • Theorem 4.4: Exact setting
  • Lemma 4.5
  • ...and 35 more