Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning
Jiaming Zhang, Yujie Yang, Haoning Wang, Liping Zhang, Shengbo Eben Li
TL;DR
This work addresses safe reinforcement learning under infinite, uniform safety constraints across a compact index space $Y$. It introduces Exchange Policy Optimization (EPO), a framework that iteratively solves finite-subproblem RL tasks while adaptively managing the active constraint set through an $\eta$-infeasibility expansion and zero-multiplier deletion, ensuring safety and tractable computation. The authors establish finite termination and a bound on global constraint violation, with a convergence guarantee to near-optimal performance as $\eta \to 0$; they also demonstrate practical gains on ship route planning and agricultural spraying tasks, where EPO yields safer policies than the prior SI-CPO baseline. Overall, EPO offers a general, scalable approach for SI-safe RL that integrates RL subroutines with semi-infinite programming techniques to handle continuous safety constraints in real-world domains.
Abstract
Safe reinforcement learning (safe RL) aims to respect safety requirements while optimizing long-term performance. In many practical applications, however, the problem involves an infinite number of constraints, known as semi-infinite safe RL (SI-safe RL). Such constraints typically appear when safety conditions must be enforced across an entire continuous parameter space, such as ensuring adequate resource distribution at every spatial location. In this paper, we propose exchange policy optimization (EPO), an algorithmic framework that achieves optimal policy performance and deterministic bounded safety. EPO works by iteratively solving safe RL subproblems with finite constraint sets and adaptively adjusting the active set through constraint expansion and deletion. At each iteration, constraints with violations exceeding the predefined tolerance are added to refine the policy, while those with zero Lagrange multipliers are removed after the policy update. This exchange rule prevents uncontrolled growth of the working set and supports effective policy training. Our theoretical analysis demonstrates that, under mild assumptions, strategies trained via EPO achieve performance comparable to optimal solutions with global constraint violations strictly remaining within a prescribed bound.
