Penalized Proximal Policy Optimization for Safe Reinforcement Learning

Linrui Zhang; Li Shen; Long Yang; Shixiang Chen; Bo Yuan; Xueqian Wang; Dacheng Tao

Penalized Proximal Policy Optimization for Safe Reinforcement Learning

Linrui Zhang, Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian Wang, Dacheng Tao

TL;DR

P3O reframes constrained policy optimization as an unconstrained problem using an exact penalty, then applies a clipped surrogate PPO objective to enforce a trust region, enabling efficient first-order optimization without Hessian inversions. The authors prove the penalty is exact for finite kappa and bound the approximation error, while extending the approach to multiple constraints and multi-agent settings. Empirical results on Circle, Gather, Navigation, and Simple Spread show P3O achieves higher rewards with tighter constraint satisfaction than state-of-the-art safe-RL methods, and MAP3O demonstrates strong multi-agent collaboration under safety requirements. The work offers a scalable, practical framework for safe reinforcement learning in complex CMDPs and DEC-POMDPs.

Abstract

Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the proposed method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.

Penalized Proximal Policy Optimization for Safe Reinforcement Learning

TL;DR

Abstract

Paper Structure (24 sections, 10 theorems, 33 equations, 7 figures, 3 tables, 4 algorithms)

This paper contains 24 sections, 10 theorems, 33 equations, 7 figures, 3 tables, 4 algorithms.

Introduction
Related Work
Primal-Dual solution.
Primal solution.
Preliminaries
Methodology
Experiments
Single-Constraint Scenario.
Sensitivity Analysis.
Multi-Constraint Scenario.
Multi-Agent Scenario.
Conclusion
Proof of Proposition 1.
Proof of Theorem 2.
Proof of Theorem 3.
...and 9 more sections

Key Result

Proposition 1

The new policy $\pi_{k+1}$ obtained from the current policy $\pi_k$ via problem cppo1 yields a monotonic return improvement and hard constraint satisfaction.

Figures (7)

Figure 1: Experimental benchmarks. (a) Circle: The agent is rewarded for moving in a specified wide circle, but is constrained to stay within a safe region smaller than the radius of the circle. (b) Gather: The agent is rewarded for gathering green apples, but is constrained to avoid red bombs. (c) Navigation: The agent is rewarded for reaching the target area(green) but is constrained to avoid virtual hazards(light purple) and impassible pillars(dark purple). Note that the cost for hazards and pillars are calculated separately and hold different upper limits. (d) Simple Spread: Agents are rewarded for reaching corresponding destinations, but are constrained to the mutual collision. The observation of each agent is not shared in the CMDP execution.
Figure 2: Average episode return and cost in the single-constraint scenario. The x-axis is the number of interactions with the emulator. The y-axis is the average reward/cost-return. The solid line is the mean and the shaded area is the standard deviation. Each tested algorithm runs over five different seeds. The dashed line in the cost plot is the constraint threshold which is 50 for Circle and 0.5 for Gather.
Figure 3: Performance of P3O for different $\kappa$ settings on AntCircle.
Figure 4: Performance of P3O for different cost limit $d$ on AntCircle.
Figure 5: Average episode return(left), cost1(center, for hazards) and cost2(right, for pillars) in the multiple-constraint scenario. The dashed line in the cost plot is the constraint threshold which is 25 for cost1 and 20 for cost2. Hazard/Pillar constrained means only taking cost1/cost2 into P3O loss function whereas ignoring the other one.
...and 2 more figures

Theorems & Definitions (20)

Proposition 1
proof
Theorem 1
proof
Theorem 2
proof
Lemma 1
proof
Proposition
proof : Proof of Proposition 1
...and 10 more

Penalized Proximal Policy Optimization for Safe Reinforcement Learning

TL;DR

Abstract

Penalized Proximal Policy Optimization for Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (20)