Feasible Policy Iteration for Safe Reinforcement Learning
Yujie Yang, Zhilong Zheng, Shengbo Eben Li, Wei Xu, Jingjing Liu, Xianyuan Zhan, Ya-Qin Zhang
TL;DR
This work introduces Feasible Policy Iteration (FPI), the first dynamic-programming framework for safe reinforcement learning, by extending policy iteration with an actor-critic-scenery (ACS) loop that incorporates a feasibility function. FPI guarantees monotonic expansion of the safe working region and monotonic improvement of the value function within that region, with geometric convergence to the optimal safe policy through risky and feasible Bellman operators. A theoretical analysis establishes contraction properties, restricted optimality, and region-superiority lemmas, while a neural-implementation (FPI-SAC) demonstrates practical performance on high- and low-dimensional tasks with zero constraint violations. Empirically, FPI outperforms or matches baselines across grid-world, classic control, and safety-robot navigation tasks, offering both safety guarantees and competitive rewards in complex control settings.
Abstract
Safety is the priority concern when applying reinforcement learning (RL) algorithms to real-world control problems. While policy iteration provides a fundamental algorithm for standard RL, an analogous theoretical algorithm for safe RL remains absent. In this paper, we propose feasible policy iteration (FPI), the first foundational dynamic programming algorithm for safe RL. FPI alternates between policy evaluation, region identification and policy improvement. This follows actor-critic-scenery (ACS) framework where scenery refers to a feasibility function that represents a feasible region. A region-wise update rule is developed for the policy improvement step, which maximizes state-value function inside the feasible region and minimizes feasibility function outside it. With this update rule, FPI guarantees monotonic expansion of feasible region, monotonic improvement of state-value function, and geometric convergence to the optimal safe policy. Experimental results demonstrate that FPI achieves strictly zero constraint violation on low-dimensional tasks and outperforms existing methods in constraint adherence and reward performance on high-dimensional tasks.
