Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints
Shiqing Gao, Jiaxin Ding, Luoyi Fu, Xinbing Wang, Chenghu Zhou
TL;DR
This paper addresses constrained reinforcement learning by proposing Exterior Penalty Policy Optimization (EPO), a first-order primal-space method augmented with a Penalty Metric Network (PMN) that adaptively scales penalties according to constraint violations. The PMN uses a two-stream cost critic architecture to generate a penalty metric that informs an unconstrained optimization of the objective via a surrogate penalty $P(\pi,\mu)$. The authors provide convergence guarantees, surrogate error bounds, and a practical smooth penalty formulation that integrates with PPO-style updates. Empirically, EPO with PMN outperforms state-of-the-art baselines on Safety Gymnasium and Safety MuJoCo, delivering faster constraint satisfaction and stable training while maintaining high policy performance. The work offers a scalable, theoretically grounded approach to CRL with robust applicability to complex, safety-critical tasks.
Abstract
In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints. The penalty function method has recently been studied as an effective approach for handling constraints, which imposes constraints penalties on the objective to transform the constrained problem into an unconstrained one. However, it is challenging to choose appropriate penalties that balance policy performance and constraint satisfaction efficiently. In this paper, we propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN). PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration. We theoretically prove that EPO consistently improves constraint satisfaction with a convergence guarantee. We propose a new surrogate function and provide worst-case constraint violation and approximation error. In practice, we propose an effective smooth penalty function, which can be easily implemented with a first-order optimizer. Extensive experiments are conducted, showing that EPO outperforms the baselines in terms of policy performance and constraint satisfaction with a stable training process, particularly on complex tasks.
