Feasible Policy Iteration for Safe Reinforcement Learning

Yujie Yang; Zhilong Zheng; Shengbo Eben Li; Wei Xu; Jingjing Liu; Xianyuan Zhan; Ya-Qin Zhang

Feasible Policy Iteration for Safe Reinforcement Learning

Yujie Yang, Zhilong Zheng, Shengbo Eben Li, Wei Xu, Jingjing Liu, Xianyuan Zhan, Ya-Qin Zhang

TL;DR

This work introduces Feasible Policy Iteration (FPI), the first dynamic-programming framework for safe reinforcement learning, by extending policy iteration with an actor-critic-scenery (ACS) loop that incorporates a feasibility function. FPI guarantees monotonic expansion of the safe working region and monotonic improvement of the value function within that region, with geometric convergence to the optimal safe policy through risky and feasible Bellman operators. A theoretical analysis establishes contraction properties, restricted optimality, and region-superiority lemmas, while a neural-implementation (FPI-SAC) demonstrates practical performance on high- and low-dimensional tasks with zero constraint violations. Empirically, FPI outperforms or matches baselines across grid-world, classic control, and safety-robot navigation tasks, offering both safety guarantees and competitive rewards in complex control settings.

Abstract

Safety is the priority concern when applying reinforcement learning (RL) algorithms to real-world control problems. While policy iteration provides a fundamental algorithm for standard RL, an analogous theoretical algorithm for safe RL remains absent. In this paper, we propose feasible policy iteration (FPI), the first foundational dynamic programming algorithm for safe RL. FPI alternates between policy evaluation, region identification and policy improvement. This follows actor-critic-scenery (ACS) framework where scenery refers to a feasibility function that represents a feasible region. A region-wise update rule is developed for the policy improvement step, which maximizes state-value function inside the feasible region and minimizes feasibility function outside it. With this update rule, FPI guarantees monotonic expansion of feasible region, monotonic improvement of state-value function, and geometric convergence to the optimal safe policy. Experimental results demonstrate that FPI achieves strictly zero constraint violation on low-dimensional tasks and outperforms existing methods in constraint adherence and reward performance on high-dimensional tasks.

Feasible Policy Iteration for Safe Reinforcement Learning

TL;DR

Abstract

Paper Structure (28 sections, 10 theorems, 97 equations, 8 figures, 3 tables)

This paper contains 28 sections, 10 theorems, 97 equations, 8 figures, 3 tables.

Introduction
Related Work
Iterative unconstrained RL methods
Constrained policy optimization methods
Problem Formulation
Markov decision process with state constraints
Feasibility in safe reinforcement learning
Objective of safe reinforcement learning
Feasibility Function and Optimality Condition
Notation declaration
Feasibility function
Optimality condition for safe reinforcement learning
Feasible Policy Iteration
Policy evaluation
Region identification
...and 13 more sections

Key Result

Proposition IV.1

For any policy $\pi$ such that $F^\pi=F^*$, we have $\mathrm{X}^\pi=\mathrm{X}^*$.

Figures (8)

Figure 1: Iteration framework and convergence mechanism of FPI. (a) PI adopts the AC framework. (b) The ACS framework for FPI features an additional RID step and a region-wise PIM. (c) The iteration of FPI converges to the optimal solution of safe RL.
Figure 2: Monotonicity and convergence of FPI. The feasible region monotonically expands and converges to the maximum feasible region. The state-value function monotonically increases inside the feasible region and converges to the optimal state-value function.
Figure 3: Four grid world tasks. A denotes the agent. G denotes the goal. I in (a) denotes the interruption. B in (a) denotes the button. X in (b) denotes the box. P in (c) denotes the punishment. The arrows in (d) denote the checkpoints.
Figure 4: Schematics of four classic control tasks.
Figure 5: Snapshots of four robot navigation tasks. The robots are in red. The blue circles represent the hazards. The Green cylinders represent the goals. The yellow objects represent the boxes.
...and 3 more figures

Theorems & Definitions (31)

Definition III.1: State feasibility
Definition III.2: Policy feasibility
Definition IV.1: Feasibility function
Example IV.1: Cost value function
Example IV.2: Constraint decay function
Definition IV.2: Optimal feasibility function
Proposition IV.1
Definition IV.3: Optimal state-value function
Definition IV.4: Optimal policy
Definition IV.5: Risky Bellman operator
...and 21 more

Feasible Policy Iteration for Safe Reinforcement Learning

TL;DR

Abstract

Feasible Policy Iteration for Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (31)