Table of Contents
Fetching ...

Projection-Based Constrained Policy Optimization

Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, Peter J. Ramadge

TL;DR

The paper tackles learning control policies that maximize reward under safety and fairness constraints by introducing PCPO, a two-stage method that first improves reward within a trust region and then projects the policy onto the constraint set. It provides theoretical bounds for reward improvement and constraint violation under KL and $L^2$ projections and analyzes convergence via the Fisher information matrix. Empirically, PCPO delivers substantially fewer constraint violations and higher rewards than state-of-the-art methods across four control tasks, demonstrating robustness to infeasibilities and approximation errors. The work advances safe and reliable RL deployment by guaranteeing constraint satisfaction during learning and offering practical, hyperparameter-free updates.

Abstract

We consider the problem of learning control policies that optimize a reward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO). This is an iterative method for optimizing policies in a two-step process: the first step performs a local reward improvement update, while the second step reconciles any constraint violation by projecting the policy back onto the constraint set. We theoretically analyze PCPO and provide a lower bound on reward improvement, and an upper bound on constraint violation, for each policy update. We further characterize the convergence of PCPO based on two different metrics: $\normltwo$ norm and Kullback-Leibler divergence. Our empirical results over several control tasks demonstrate that PCPO achieves superior performance, averaging more than 3.5 times less constraint violation and around 15\% higher reward compared to state-of-the-art methods.

Projection-Based Constrained Policy Optimization

TL;DR

The paper tackles learning control policies that maximize reward under safety and fairness constraints by introducing PCPO, a two-stage method that first improves reward within a trust region and then projects the policy onto the constraint set. It provides theoretical bounds for reward improvement and constraint violation under KL and projections and analyzes convergence via the Fisher information matrix. Empirically, PCPO delivers substantially fewer constraint violations and higher rewards than state-of-the-art methods across four control tasks, demonstrating robustness to infeasibilities and approximation errors. The work advances safe and reliable RL deployment by guaranteeing constraint satisfaction during learning and offering practical, hyperparameter-free updates.

Abstract

We consider the problem of learning control policies that optimize a reward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO). This is an iterative method for optimizing policies in a two-step process: the first step performs a local reward improvement update, while the second step reconciles any constraint violation by projecting the policy back onto the constraint set. We theoretically analyze PCPO and provide a lower bound on reward improvement, and an upper bound on constraint violation, for each policy update. We further characterize the convergence of PCPO based on two different metrics: norm and Kullback-Leibler divergence. Our empirical results over several control tasks demonstrate that PCPO achieves superior performance, averaging more than 3.5 times less constraint violation and around 15\% higher reward compared to state-of-the-art methods.

Paper Structure

This paper contains 21 sections, 10 theorems, 53 equations, 14 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Define $\epsilon^{\pi^{k+1}}_{R}\doteq \max\limits_{s}|\mathrm{E}_{a\sim\pi^{k+1}}[A_{R}^{\pi^{k}}(s,a)]|$, and $\epsilon^{\pi^{k+1}}_{C}\doteq \max\limits_{s}|\mathrm{E}_{a\sim\pi^{k+1}}[A_{C}^{\pi^{k}}(s,a)]|$. If the current policy $\pi^k$ satisfies the constraint, then under KL divergence projec where $\delta$ is the step size in the reward improvement step.

Figures (14)

  • Figure 1: Update procedures for PCPO. In step one (red arrow), PCPO follows the reward improvement direction in the trust region (light green). In step two (blue arrow), PCPO projects the policy onto the constraint set (light orange).
  • Figure 2: Update procedures for CPO achiam2017constrained. CPO computes the update by simultaneously considering the trust region (light green) and the constraint set (light orange). CPO becomes infeasible when these two sets do not intersect.
  • Figure 3: The gather, circle, grid and bottleneck tasks. (a) Gather task: the agent is rewarded for gathering green apples but is constrained to collect a limited number of red fruit achiam2017constrained. (b) Circle task: the agent is rewarded for moving in a specified wide circle, but is constrained to stay within a safe region smaller than the radius of the circle achiam2017constrained. (c) Grid task: the agent controls the traffic lights in a grid road network and is rewarded for high throughput but constrained to let lights stay red for at most 7 consecutive seconds vinitsky2018benchmarks. (d) Bottleneck task: the agent controls a set of autonomous vehicles (shown in red) in a traffic merge situation and is rewarded for achieving high throughput but constrained to ensure that human-driven vehicles (shown in white) have low speed for no more than 10 seconds vinitsky2018benchmarks.
  • Figure 4: The values of the discounted reward and the undiscounted constraint value (the total number of constraint violation) along policy updates for the tested algorithms and task pairs. The solid line is the mean and the shaded area is the standard deviation, over five runs. The dashed line in the cost constraint plot is the cost constraint threshold $h$. The curves for baseline oracle, TRPO, indicate the reward and constraint violation values when the constraint is ignored. (Best viewed in color, and the legend is shared across all the figures.)
  • Figure 5: The value of the discounted reward versus the cumulative constraint value for the tested algorithms and task pairs. See the supplemental material for learning curves in the other tasks. PCPO achieves less constraint violation under the same reward improvement compared to the other algorithms.
  • ...and 9 more figures

Theorems & Definitions (19)

  • Theorem 3.1: Worst-case Bound on Updating Constraint-satisfying Policies
  • proof
  • Theorem 3.2: Worst-case Bound on Updating Constraint-violating Policies
  • proof
  • Theorem 4.1: Reward Improvement Under $L^2$ Norm and KL Divergence Projections
  • proof
  • Lemma S.1
  • proof
  • Theorem S.2
  • proof
  • ...and 9 more