Anytime Safe Reinforcement Learning
Pol Mestres, Arnau Marzabal, Jorge Cortés
TL;DR
This work addresses constrained reinforcement learning with safety guarantees that hold anytime, even if the algorithm is terminated prematurely. It introduces RL-SGF, an on-policy method that uses episodic estimates of $V_0$, $V_1$ and their gradients to update policies via a convex quadratically constrained quadratic program, ensuring feasibility with high probability at each step. Theoretical results establish finite-sample safety guarantees and convergence to a neighborhood of a KKT point, with the neighborhood shrinking as more episodes are used for estimation. Empirical results on a 2D navigation task show RL-SGF achieves strong safety performance while delivering competitive returns compared to primal-dual methods and CPO.
Abstract
This paper considers the problem of solving constrained reinforcement learning problems with anytime guarantees, meaning that the algorithmic solution returns a safe policy regardless of when it is terminated. Drawing inspiration from anytime constrained optimization, we introduce Reinforcement Learning-based Safe Gradient Flow (RL-SGF), an on-policy algorithm which employs estimates of the value functions and their respective gradients associated with the objective and safety constraints for the current policy, and updates the policy parameters by solving a convex quadratically constrained quadratic program. We show that if the estimates are computed with a sufficiently large number of episodes (for which we provide an explicit bound), safe policies are updated to safe policies with a probability higher than a prescribed tolerance. We also show that iterates asymptotically converge to a neighborhood of a KKT point, whose size can be arbitrarily reduced by refining the estimates of the value function and their gradients. We illustrate the performance of RL-SGF in a navigation example.
