Penalized Proximal Policy Optimization for Safe Reinforcement Learning
Linrui Zhang, Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian Wang, Dacheng Tao
TL;DR
P3O reframes constrained policy optimization as an unconstrained problem using an exact penalty, then applies a clipped surrogate PPO objective to enforce a trust region, enabling efficient first-order optimization without Hessian inversions. The authors prove the penalty is exact for finite kappa and bound the approximation error, while extending the approach to multiple constraints and multi-agent settings. Empirical results on Circle, Gather, Navigation, and Simple Spread show P3O achieves higher rewards with tighter constraint satisfaction than state-of-the-art safe-RL methods, and MAP3O demonstrates strong multi-agent collaboration under safety requirements. The work offers a scalable, practical framework for safe reinforcement learning in complex CMDPs and DEC-POMDPs.
Abstract
Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the proposed method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
