Table of Contents
Fetching ...

Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning

Jiaming Zhang, Yujie Yang, Haoning Wang, Liping Zhang, Shengbo Eben Li

TL;DR

This work addresses safe reinforcement learning under infinite, uniform safety constraints across a compact index space $Y$. It introduces Exchange Policy Optimization (EPO), a framework that iteratively solves finite-subproblem RL tasks while adaptively managing the active constraint set through an $\eta$-infeasibility expansion and zero-multiplier deletion, ensuring safety and tractable computation. The authors establish finite termination and a bound on global constraint violation, with a convergence guarantee to near-optimal performance as $\eta \to 0$; they also demonstrate practical gains on ship route planning and agricultural spraying tasks, where EPO yields safer policies than the prior SI-CPO baseline. Overall, EPO offers a general, scalable approach for SI-safe RL that integrates RL subroutines with semi-infinite programming techniques to handle continuous safety constraints in real-world domains.

Abstract

Safe reinforcement learning (safe RL) aims to respect safety requirements while optimizing long-term performance. In many practical applications, however, the problem involves an infinite number of constraints, known as semi-infinite safe RL (SI-safe RL). Such constraints typically appear when safety conditions must be enforced across an entire continuous parameter space, such as ensuring adequate resource distribution at every spatial location. In this paper, we propose exchange policy optimization (EPO), an algorithmic framework that achieves optimal policy performance and deterministic bounded safety. EPO works by iteratively solving safe RL subproblems with finite constraint sets and adaptively adjusting the active set through constraint expansion and deletion. At each iteration, constraints with violations exceeding the predefined tolerance are added to refine the policy, while those with zero Lagrange multipliers are removed after the policy update. This exchange rule prevents uncontrolled growth of the working set and supports effective policy training. Our theoretical analysis demonstrates that, under mild assumptions, strategies trained via EPO achieve performance comparable to optimal solutions with global constraint violations strictly remaining within a prescribed bound.

Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning

TL;DR

This work addresses safe reinforcement learning under infinite, uniform safety constraints across a compact index space . It introduces Exchange Policy Optimization (EPO), a framework that iteratively solves finite-subproblem RL tasks while adaptively managing the active constraint set through an -infeasibility expansion and zero-multiplier deletion, ensuring safety and tractable computation. The authors establish finite termination and a bound on global constraint violation, with a convergence guarantee to near-optimal performance as ; they also demonstrate practical gains on ship route planning and agricultural spraying tasks, where EPO yields safer policies than the prior SI-CPO baseline. Overall, EPO offers a general, scalable approach for SI-safe RL that integrates RL subroutines with semi-infinite programming techniques to handle continuous safety constraints in real-world domains.

Abstract

Safe reinforcement learning (safe RL) aims to respect safety requirements while optimizing long-term performance. In many practical applications, however, the problem involves an infinite number of constraints, known as semi-infinite safe RL (SI-safe RL). Such constraints typically appear when safety conditions must be enforced across an entire continuous parameter space, such as ensuring adequate resource distribution at every spatial location. In this paper, we propose exchange policy optimization (EPO), an algorithmic framework that achieves optimal policy performance and deterministic bounded safety. EPO works by iteratively solving safe RL subproblems with finite constraint sets and adaptively adjusting the active set through constraint expansion and deletion. At each iteration, constraints with violations exceeding the predefined tolerance are added to refine the policy, while those with zero Lagrange multipliers are removed after the policy update. This exchange rule prevents uncontrolled growth of the working set and supports effective policy training. Our theoretical analysis demonstrates that, under mild assumptions, strategies trained via EPO achieve performance comparable to optimal solutions with global constraint violations strictly remaining within a prescribed bound.

Paper Structure

This paper contains 9 sections, 5 theorems, 46 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Suppose Assumption (A1) holds. The sequence of optimal values $\{J_k^*\}$ is non-increasing, i.e., $J_{k+1}^\star \le J_k^\star$.

Figures (6)

  • Figure 1: Schematic diagram of agricultural aerial application problem
  • Figure 2: Exchange policy optimization framework. Our framework consists of four stages: violation detection, constraint expansion, subproblem solving, and constraint deletion. The rectangle denotes the domain of all constraint points, and the shaded region indicates the current active set. At each iteration, EPO first detects violated constraints, then expands the working set with the corresponding points, solves the resulting subproblem, and finally deletes points with zero Lagrange multipliers.
  • Figure 3: Visualization of the solution trajectories and constraint violation trained by EPO (left) and SI-CPPO (right). The asterisk at the center marks the location of the ecological reserve. The green straight line represents the path obtained from training EPO, while the red line corresponds to SI-CPPO. The values in the heatmap illustrate 5 times the constraint violation value, i.e., $5(J_{c_y}(\pi)-d_y)_{+}$. Higher values indicate more severe constraint violations.
  • Figure 4: Performance comparison between EPO and SI-CPPO over iterations in terms of (a) cumulative reward and (b) maximal constraint violation. The solid lines denote the average results over 10 random seeds, and the shaded areas indicate the 95% confidence interval.
  • Figure 5: Visualization of the solutions and constraint violation trained by EPO (left) and SI-CPPO (right). The asterisks mark the locations of the planting centers. The green straight line represents the path obtained from training EPO, while the red line corresponds to SI-CPPO. The values in the heatmap illustrate 5 times the constraint violation value, i.e., $5(J_{c_y}(\pi)-d_y)_{+}$. Higher values indicate more severe constraint violations.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 4
  • Theorem 5