Table of Contents
Fetching ...

Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs

Sergio Rozada, Dongsheng Ding, Antonio G. Marques, Alejandro Ribeiro

TL;DR

This work addresses the challenge of computing deterministic optimal policies for constrained MDPs with continuous state-action spaces by establishing zero duality gap under a non-atomicity assumption and proposing a deterministic policy gradient primal-dual (D-PGPD) method with a quadratic regularized Lagrangian. The algorithm updates the deterministic policy via a proximal ascent and the dual variable via gradient descent, with theoretical results showing non-asymptotic, linear convergence to a regularized saddle point; function approximation leads to AD-PGPD with provable near-optimality up to an approximation error. A model-free, sample-based variant (AD-PGPD) is developed to handle unknown dynamics, and empirical results on robot navigation and Burgers' equation demonstrate improved constraint satisfaction and reduced oscillations compared to stochastic-policy baselines. Overall, the paper advances deterministic policy search in continuous-space constrained MDPs, with implications for safe, scalable control in robotics and fluid dynamics and avenues for online learning and tighter sample-efficiency guarantees.

Abstract

We study the problem of computing deterministic optimal policies for constrained Markov decision processes (MDPs) with continuous state and action spaces, which are widely encountered in constrained dynamical systems. Designing deterministic policy gradient methods in continuous state and action spaces is particularly challenging due to the lack of enumerable state-action pairs and the adoption of deterministic policies, hindering the application of existing policy gradient methods. To this end, we develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence. Specifically, we leverage regularization of the Lagrangian of the constrained MDP to propose a deterministic policy gradient primal-dual (D-PGPD) algorithm that updates the deterministic policy via a quadratic-regularized gradient ascent step and the dual variable via a quadratic-regularized gradient descent step. We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair. We instantiate D-PGPD with function approximation and prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair, up to a function approximation error. Furthermore, we demonstrate the effectiveness of our method in two continuous control problems: robot navigation and fluid control. This appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.

Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs

TL;DR

This work addresses the challenge of computing deterministic optimal policies for constrained MDPs with continuous state-action spaces by establishing zero duality gap under a non-atomicity assumption and proposing a deterministic policy gradient primal-dual (D-PGPD) method with a quadratic regularized Lagrangian. The algorithm updates the deterministic policy via a proximal ascent and the dual variable via gradient descent, with theoretical results showing non-asymptotic, linear convergence to a regularized saddle point; function approximation leads to AD-PGPD with provable near-optimality up to an approximation error. A model-free, sample-based variant (AD-PGPD) is developed to handle unknown dynamics, and empirical results on robot navigation and Burgers' equation demonstrate improved constraint satisfaction and reduced oscillations compared to stochastic-policy baselines. Overall, the paper advances deterministic policy search in continuous-space constrained MDPs, with implications for safe, scalable control in robotics and fluid dynamics and avenues for online learning and tighter sample-efficiency guarantees.

Abstract

We study the problem of computing deterministic optimal policies for constrained Markov decision processes (MDPs) with continuous state and action spaces, which are widely encountered in constrained dynamical systems. Designing deterministic policy gradient methods in continuous state and action spaces is particularly challenging due to the lack of enumerable state-action pairs and the adoption of deterministic policies, hindering the application of existing policy gradient methods. To this end, we develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence. Specifically, we leverage regularization of the Lagrangian of the constrained MDP to propose a deterministic policy gradient primal-dual (D-PGPD) algorithm that updates the deterministic policy via a quadratic-regularized gradient ascent step and the dual variable via a quadratic-regularized gradient descent step. We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair. We instantiate D-PGPD with function approximation and prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair, up to a function approximation error. Furthermore, we demonstrate the effectiveness of our method in two continuous control problems: robot navigation and fluid control. This appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
Paper Structure (28 sections, 19 theorems, 160 equations, 11 figures, 3 algorithms)

This paper contains 28 sections, 19 theorems, 160 equations, 11 figures, 3 algorithms.

Key Result

Lemma 1

For a non-atomic discounted MDP, the deterministic value image ${\mathcal{V}}_D$ is convex, and equals the value image ${\mathcal{V}}_T$, i.e., ${\mathcal{V}}_D = {\mathcal{V}}_T$.

Figures (11)

  • Figure 1: Navigation trajectories of an agent (Left) and velocity profile of the fluid over time (Right).
  • Figure 2: Avg. reward/utility value functions of AD-PGPD ( ) and PGDual ( ) iterates in the navigation problem.
  • Figure 3: Avg. reward/utility value functions of AD-PGPD ( ) and PGDual ( ) iterates in a fluid velocity control.
  • Figure 4: The deterministic value image ${\mathcal{V}}_D$ is convex and equivalent to the the value image for all policies: ${\mathcal{V}}_T$. Furthermore, constrained RL has zero duality gap in the deterministic policy space, i.e., $V_P^\star=V_D^\star$.
  • Figure 5: Reward and utility value functions of policy iterates generated by D-PGPD ( ) and AD-PGPD ( ) in the navigation control problem with quadratic rewards.
  • ...and 6 more figures

Theorems & Definitions (37)

  • Lemma 1: Sufficiency of deterministic policies
  • Theorem 1: Zero duality gap
  • Theorem 2: Linear convergence
  • Corollary 1: Near-optimality
  • Theorem 3: Linear convergence
  • Corollary 2: Near-optimality of approximation
  • Corollary 3: Linear convergence
  • Lemma 2: Discounted and uniformly absorbing MDP equivalence
  • proof
  • Lemma 3: Convexity of the deterministic value image
  • ...and 27 more