Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

Shiqing Gao; Jiaxin Ding; Luoyi Fu; Xinbing Wang; Chenghu Zhou

Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

Shiqing Gao, Jiaxin Ding, Luoyi Fu, Xinbing Wang, Chenghu Zhou

TL;DR

This paper addresses constrained reinforcement learning by proposing Exterior Penalty Policy Optimization (EPO), a first-order primal-space method augmented with a Penalty Metric Network (PMN) that adaptively scales penalties according to constraint violations. The PMN uses a two-stream cost critic architecture to generate a penalty metric that informs an unconstrained optimization of the objective via a surrogate penalty $P(\pi,\mu)$. The authors provide convergence guarantees, surrogate error bounds, and a practical smooth penalty formulation that integrates with PPO-style updates. Empirically, EPO with PMN outperforms state-of-the-art baselines on Safety Gymnasium and Safety MuJoCo, delivering faster constraint satisfaction and stable training while maintaining high policy performance. The work offers a scalable, theoretically grounded approach to CRL with robust applicability to complex, safety-critical tasks.

Abstract

In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints. The penalty function method has recently been studied as an effective approach for handling constraints, which imposes constraints penalties on the objective to transform the constrained problem into an unconstrained one. However, it is challenging to choose appropriate penalties that balance policy performance and constraint satisfaction efficiently. In this paper, we propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN). PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration. We theoretically prove that EPO consistently improves constraint satisfaction with a convergence guarantee. We propose a new surrogate function and provide worst-case constraint violation and approximation error. In practice, we propose an effective smooth penalty function, which can be easily implemented with a first-order optimizer. Extensive experiments are conducted, showing that EPO outperforms the baselines in terms of policy performance and constraint satisfaction with a stable training process, particularly on complex tasks.

Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

TL;DR

. The authors provide convergence guarantees, surrogate error bounds, and a practical smooth penalty formulation that integrates with PPO-style updates. Empirically, EPO with PMN outperforms state-of-the-art baselines on Safety Gymnasium and Safety MuJoCo, delivering faster constraint satisfaction and stable training while maintaining high policy performance. The work offers a scalable, theoretically grounded approach to CRL with robust applicability to complex, safety-critical tasks.

Abstract

Paper Structure (17 sections, 5 theorems, 18 equations, 5 figures, 1 algorithm)

This paper contains 17 sections, 5 theorems, 18 equations, 5 figures, 1 algorithm.

Introduction
Related Work
Primal-Dual methods.
Primal methods.
Preliminaries
Methodology
Exterior Penalty Function with Penalty Metric Network
Exterior Penalty Policy Optimization and Convergence Analysis
Surrogate Penalty Function within Trust Region and Theoretical Bounds
Smooth Penalty Function in Practical Implementation
Experiment
Scenario Description.
Safety Gymnasium.
Safety MuJoCo.
Ablation Experiments.
...and 2 more sections

Key Result

Lemma 1

Suppose $\pi_t$ is the global maximum policy of the penalty function $P(\pi, \mu_t)$ with factor $\mu_t$, $\bar{\pi}$ is the optimal solution of the constrained problem (eq6), and $F_R(\pi)$ is the objective function, then the inequality holds Decreasing the penalty factor $\mu_{t+1} < \mu_t$, we get The constraint $F_C(\pi_t)$ is monotonically non-increasing with the decreasing $\mu_t$ and same

Figures (5)

Figure 1: Structure of the Penalty Metric Network in EPO, comprising two streams to capture the near-penalty and far-penalty. $V_N$ and $V_F$ denote the linear and quadratic cost value critics, respectively. The Weighting Layer integrates the different types of penalties and outputs the Penalty Metric, which is imposed on the objective to guide the policy update.
Figure 2: Comparison of EPO to baselines over 3 seeds on Safety Gymnasium. The x-axis is the training steps, the y-axis is the average return or constraint. The solid line is the mean and the shaded area is the standard deviation. The dashed line marks the constraint limit of 25.
Figure 3: Comparison of EPO to the baselines over 3 seeds on Safety MuJoCo. The dashed line marks the constraint limit of 25.
Figure 4: Ablation experiments in PointGoal1-v0.
Figure 5: EPO in PointGoal1-v0 with different cost limits.

Theorems & Definitions (5)

Lemma 1
Theorem 1
Corollary 1
Proposition 1: achiam2017constrained
Theorem 2

Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

TL;DR

Abstract

Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (5)