Table of Contents
Fetching ...

Feasible Policy Iteration for Safe Reinforcement Learning

Yujie Yang, Zhilong Zheng, Shengbo Eben Li, Wei Xu, Jingjing Liu, Xianyuan Zhan, Ya-Qin Zhang

TL;DR

This work introduces Feasible Policy Iteration (FPI), the first dynamic-programming framework for safe reinforcement learning, by extending policy iteration with an actor-critic-scenery (ACS) loop that incorporates a feasibility function. FPI guarantees monotonic expansion of the safe working region and monotonic improvement of the value function within that region, with geometric convergence to the optimal safe policy through risky and feasible Bellman operators. A theoretical analysis establishes contraction properties, restricted optimality, and region-superiority lemmas, while a neural-implementation (FPI-SAC) demonstrates practical performance on high- and low-dimensional tasks with zero constraint violations. Empirically, FPI outperforms or matches baselines across grid-world, classic control, and safety-robot navigation tasks, offering both safety guarantees and competitive rewards in complex control settings.

Abstract

Safety is the priority concern when applying reinforcement learning (RL) algorithms to real-world control problems. While policy iteration provides a fundamental algorithm for standard RL, an analogous theoretical algorithm for safe RL remains absent. In this paper, we propose feasible policy iteration (FPI), the first foundational dynamic programming algorithm for safe RL. FPI alternates between policy evaluation, region identification and policy improvement. This follows actor-critic-scenery (ACS) framework where scenery refers to a feasibility function that represents a feasible region. A region-wise update rule is developed for the policy improvement step, which maximizes state-value function inside the feasible region and minimizes feasibility function outside it. With this update rule, FPI guarantees monotonic expansion of feasible region, monotonic improvement of state-value function, and geometric convergence to the optimal safe policy. Experimental results demonstrate that FPI achieves strictly zero constraint violation on low-dimensional tasks and outperforms existing methods in constraint adherence and reward performance on high-dimensional tasks.

Feasible Policy Iteration for Safe Reinforcement Learning

TL;DR

This work introduces Feasible Policy Iteration (FPI), the first dynamic-programming framework for safe reinforcement learning, by extending policy iteration with an actor-critic-scenery (ACS) loop that incorporates a feasibility function. FPI guarantees monotonic expansion of the safe working region and monotonic improvement of the value function within that region, with geometric convergence to the optimal safe policy through risky and feasible Bellman operators. A theoretical analysis establishes contraction properties, restricted optimality, and region-superiority lemmas, while a neural-implementation (FPI-SAC) demonstrates practical performance on high- and low-dimensional tasks with zero constraint violations. Empirically, FPI outperforms or matches baselines across grid-world, classic control, and safety-robot navigation tasks, offering both safety guarantees and competitive rewards in complex control settings.

Abstract

Safety is the priority concern when applying reinforcement learning (RL) algorithms to real-world control problems. While policy iteration provides a fundamental algorithm for standard RL, an analogous theoretical algorithm for safe RL remains absent. In this paper, we propose feasible policy iteration (FPI), the first foundational dynamic programming algorithm for safe RL. FPI alternates between policy evaluation, region identification and policy improvement. This follows actor-critic-scenery (ACS) framework where scenery refers to a feasibility function that represents a feasible region. A region-wise update rule is developed for the policy improvement step, which maximizes state-value function inside the feasible region and minimizes feasibility function outside it. With this update rule, FPI guarantees monotonic expansion of feasible region, monotonic improvement of state-value function, and geometric convergence to the optimal safe policy. Experimental results demonstrate that FPI achieves strictly zero constraint violation on low-dimensional tasks and outperforms existing methods in constraint adherence and reward performance on high-dimensional tasks.
Paper Structure (28 sections, 10 theorems, 97 equations, 8 figures, 3 tables)

This paper contains 28 sections, 10 theorems, 97 equations, 8 figures, 3 tables.

Key Result

Proposition IV.1

For any policy $\pi$ such that $F^\pi=F^*$, we have $\mathrm{X}^\pi=\mathrm{X}^*$.

Figures (8)

  • Figure 1: Iteration framework and convergence mechanism of FPI. (a) PI adopts the AC framework. (b) The ACS framework for FPI features an additional RID step and a region-wise PIM. (c) The iteration of FPI converges to the optimal solution of safe RL.
  • Figure 2: Monotonicity and convergence of FPI. The feasible region monotonically expands and converges to the maximum feasible region. The state-value function monotonically increases inside the feasible region and converges to the optimal state-value function.
  • Figure 3: Four grid world tasks. A denotes the agent. G denotes the goal. I in (a) denotes the interruption. B in (a) denotes the button. X in (b) denotes the box. P in (c) denotes the punishment. The arrows in (d) denote the checkpoints.
  • Figure 4: Schematics of four classic control tasks.
  • Figure 5: Snapshots of four robot navigation tasks. The robots are in red. The blue circles represent the hazards. The Green cylinders represent the goals. The yellow objects represent the boxes.
  • ...and 3 more figures

Theorems & Definitions (31)

  • Definition III.1: State feasibility
  • Definition III.2: Policy feasibility
  • Definition IV.1: Feasibility function
  • Example IV.1: Cost value function
  • Example IV.2: Constraint decay function
  • Definition IV.2: Optimal feasibility function
  • Proposition IV.1
  • Definition IV.3: Optimal state-value function
  • Definition IV.4: Optimal policy
  • Definition IV.5: Risky Bellman operator
  • ...and 21 more