Adaptive Primal-Dual Method for Safe Reinforcement Learning

Weiqin Chen; James Onyejizu; Long Vu; Lan Hoang; Dharmashankar Subramanian; Koushik Kar; Sandipan Mishra; Santiago Paternain

Adaptive Primal-Dual Method for Safe Reinforcement Learning

Weiqin Chen, James Onyejizu, Long Vu, Lan Hoang, Dharmashankar Subramanian, Koushik Kar, Sandipan Mishra, Santiago Paternain

TL;DR

This work tackles Safe Reinforcement Learning by formulating CMDPs and addressing the interdependence between the primal learning rate and dual variables. It introduces Adaptive Primal-Dual (APD) methods with two LR rules that depend inversely on the Lagrangian multipliers, providing convergence, return optimality, and feasibility guarantees. A practical variant, PAPD, uses InvLin and InvQua learning rates along with PID-Lagrangian dual updates, and is empirically evaluated against constant-LR baselines on four Bullet-Safety-Gym environments with PPO-Lagrangian and DDPG-Lagrangian, showing improved stability and often superior performance. The results demonstrate robustness to hyper-parameter choices and suggest broad practical impact for safe RL in constrained settings, with theoretical underpinnings complemented by extensive experiments and supplementary proofs.

Abstract

Primal-dual methods have a natural application in Safe Reinforcement Learning (SRL), posed as a constrained policy optimization problem. In practice however, applying primal-dual methods to SRL is challenging, due to the inter-dependency of the learning rate (LR) and Lagrangian multipliers (dual variables) each time an embedded unconstrained RL problem is solved. In this paper, we propose, analyze and evaluate adaptive primal-dual (APD) methods for SRL, where two adaptive LRs are adjusted to the Lagrangian multipliers so as to optimize the policy in each iteration. We theoretically establish the convergence, optimality and feasibility of the APD algorithm. Finally, we conduct numerical evaluation of the practical APD algorithm with four well-known environments in Bullet-Safey-Gym employing two state-of-the-art SRL algorithms: PPO-Lagrangian and DDPG-Lagrangian. All experiments show that the practical APD algorithm outperforms (or achieves comparable performance) and attains more stable training than the constant LR cases. Additionally, we substantiate the robustness of selecting the two adaptive LRs by empirical evidence.

Adaptive Primal-Dual Method for Safe Reinforcement Learning

TL;DR

Abstract

Paper Structure (24 sections, 7 theorems, 76 equations, 4 figures, 6 tables, 2 algorithms)

This paper contains 24 sections, 7 theorems, 76 equations, 4 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Primal Methods.
Primal-Dual Methods.
Main Contribution
Safe Reinforcement Learning
Adaptive Primal-Dual Algorithm
Motivation
Adaptive Learning Rate
Experiments
Environment
Results
Robustness Verification
Concluding Remarks
Supplementary Material
...and 9 more sections

Key Result

Theorem 1

Consider the dual function $d(\cdot)$ defined in eqn_dual_function, the constraint function $g(\cdot)$ and cost limit ${\bf d}$ in eqn_constraint_func. Let $\lambda^* \in \mathop{\mathrm{arg\,max}}\limits_{\lambda \in \mathbb{R}_+^m} d(\lambda)$ and define $D^*= d(\lambda^*)$. Let $\theta_k$ and $\l

Figures (4)

Figure 1: Learning curves of PPOL over five independent runs with fixed LM values of 1 and 5. The horizontal axis represents time steps. Cost limit $\textbf{d}=10$ (black dashed line) in all experiments. LR = 0.0006 outperforms LR = 0.0003 at LM =1 (red curve is infeasible), while the opposite holds when LM = 5.
Figure 2: Learning curves for PPOL over four environments with five independent runs. In all figures, the horizontal axis is the number of time step. The solid line illustrates the mean and the shaded area depicts the maximum and the minimum. In all experiments, $H_1 = 0.001, H_2 = 3$ for InvLin, $H_1' = 0.015, H_2' = 6$ for InvQua, and cost limit $\textbf{d}=10$ (black dashed line).
Figure 3: Agents and Tasks from gronauer2022bullet: (a) The Ball agent; (b) The Car agent; (c) The Run task; (d) The Circle task.
Figure 4: Learning curves for DDPGL over four environments with five independent runs. In all figures, the horizontal axis is the number of time step. The solid line illustrates the mean and the shaded area depicts the maximum and the minimum. In all experiments, $H_1 = 0.003, H_2 = 4.5$ for InvLin, $H_1' = 0.045, H_2' = 7.5$ for InvQua, and cost limit $\textbf{d}=10$ (black dashed line).

Theorems & Definitions (7)

Theorem 1
Lemma 1
Theorem 2
Theorem 3
Theorem 4
Lemma 2
Lemma 3

Adaptive Primal-Dual Method for Safe Reinforcement Learning

TL;DR

Abstract

Adaptive Primal-Dual Method for Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)