Table of Contents
Fetching ...

Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets

Jianmina Ma, Jingtian Ji, Yue Gao

TL;DR

Adversarial Constrained Policy Optimization (ACPO) is proposed, which enables simultaneous optimization of reward and the adaptation of cost budgets during training and achieves better performances compared to commonly used baselines.

Abstract

Constrained reinforcement learning has achieved promising progress in safety-critical fields where both rewards and constraints are considered. However, constrained reinforcement learning methods face challenges in striking the right balance between task performance and constraint satisfaction and it is prone for them to get stuck in over-conservative or constraint violating local minima. In this paper, we propose Adversarial Constrained Policy Optimization (ACPO), which enables simultaneous optimization of reward and the adaptation of cost budgets during training. Our approach divides original constrained problem into two adversarial stages that are solved alternately, and the policy update performance of our algorithm can be theoretically guaranteed. We validate our method through experiments conducted on Safety Gymnasium and quadruped locomotion tasks. Results demonstrate that our algorithm achieves better performances compared to commonly used baselines.

Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets

TL;DR

Adversarial Constrained Policy Optimization (ACPO) is proposed, which enables simultaneous optimization of reward and the adaptation of cost budgets during training and achieves better performances compared to commonly used baselines.

Abstract

Constrained reinforcement learning has achieved promising progress in safety-critical fields where both rewards and constraints are considered. However, constrained reinforcement learning methods face challenges in striking the right balance between task performance and constraint satisfaction and it is prone for them to get stuck in over-conservative or constraint violating local minima. In this paper, we propose Adversarial Constrained Policy Optimization (ACPO), which enables simultaneous optimization of reward and the adaptation of cost budgets during training. Our approach divides original constrained problem into two adversarial stages that are solved alternately, and the policy update performance of our algorithm can be theoretically guaranteed. We validate our method through experiments conducted on Safety Gymnasium and quadruped locomotion tasks. Results demonstrate that our algorithm achieves better performances compared to commonly used baselines.

Paper Structure

This paper contains 31 sections, 7 theorems, 49 equations, 8 figures, 8 tables, 3 algorithms.

Key Result

Lemma 3.1

For any policies $\pi$ and $\pi'$, the following bounds hold: where $\epsilon^{\pi'}=\max_s|\mathbb{E}_{a\sim\pi'}[A^{\pi}(s,a)]|$.

Figures (8)

  • Figure 1: An illustration of our algorithm. The red dotted line represents the process of alternately iterations between max-reward and min-cost stages. The blue dotted line represents the projection stage. $\Delta d$ denotes the change of cost budget after projection stage explained in \ref{['section:projection']}.
  • Figure 2: The learning curve of different algorithms in Safety Gymnasium environments. The dashed line represents the desired cost budget.
  • Figure 3: Episode reward and cost return of different algorithms in the quadruped locomotion task. Results are the mean value of 5 training experiments using random seeds. The dashed line represents the desired cost budget = 2. The episode reward in the figure are normalized relative to maximum episode length $1000$.
  • Figure 4: An overview of single constraint environments.
  • Figure 5: An illustration of the quadruped locomotion environment.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Lemma 3.1: policy performance bound cpo
  • Definition 4.1: Pareto-optimal solution
  • Theorem 4.2
  • proof
  • Theorem 5.1
  • proof
  • Proposition 5.2
  • Lemma 1.1: Trust region update performance bound cpo
  • Lemma 1.2
  • proof
  • ...and 2 more