Table of Contents
Fetching ...

Anytime Safe Reinforcement Learning

Pol Mestres, Arnau Marzabal, Jorge Cortés

TL;DR

This work addresses constrained reinforcement learning with safety guarantees that hold anytime, even if the algorithm is terminated prematurely. It introduces RL-SGF, an on-policy method that uses episodic estimates of $V_0$, $V_1$ and their gradients to update policies via a convex quadratically constrained quadratic program, ensuring feasibility with high probability at each step. Theoretical results establish finite-sample safety guarantees and convergence to a neighborhood of a KKT point, with the neighborhood shrinking as more episodes are used for estimation. Empirical results on a 2D navigation task show RL-SGF achieves strong safety performance while delivering competitive returns compared to primal-dual methods and CPO.

Abstract

This paper considers the problem of solving constrained reinforcement learning problems with anytime guarantees, meaning that the algorithmic solution returns a safe policy regardless of when it is terminated. Drawing inspiration from anytime constrained optimization, we introduce Reinforcement Learning-based Safe Gradient Flow (RL-SGF), an on-policy algorithm which employs estimates of the value functions and their respective gradients associated with the objective and safety constraints for the current policy, and updates the policy parameters by solving a convex quadratically constrained quadratic program. We show that if the estimates are computed with a sufficiently large number of episodes (for which we provide an explicit bound), safe policies are updated to safe policies with a probability higher than a prescribed tolerance. We also show that iterates asymptotically converge to a neighborhood of a KKT point, whose size can be arbitrarily reduced by refining the estimates of the value function and their gradients. We illustrate the performance of RL-SGF in a navigation example.

Anytime Safe Reinforcement Learning

TL;DR

This work addresses constrained reinforcement learning with safety guarantees that hold anytime, even if the algorithm is terminated prematurely. It introduces RL-SGF, an on-policy method that uses episodic estimates of , and their gradients to update policies via a convex quadratically constrained quadratic program, ensuring feasibility with high probability at each step. Theoretical results establish finite-sample safety guarantees and convergence to a neighborhood of a KKT point, with the neighborhood shrinking as more episodes are used for estimation. Empirical results on a 2D navigation task show RL-SGF achieves strong safety performance while delivering competitive returns compared to primal-dual methods and CPO.

Abstract

This paper considers the problem of solving constrained reinforcement learning problems with anytime guarantees, meaning that the algorithmic solution returns a safe policy regardless of when it is terminated. Drawing inspiration from anytime constrained optimization, we introduce Reinforcement Learning-based Safe Gradient Flow (RL-SGF), an on-policy algorithm which employs estimates of the value functions and their respective gradients associated with the objective and safety constraints for the current policy, and updates the policy parameters by solving a convex quadratically constrained quadratic program. We show that if the estimates are computed with a sufficiently large number of episodes (for which we provide an explicit bound), safe policies are updated to safe policies with a probability higher than a prescribed tolerance. We also show that iterates asymptotically converge to a neighborhood of a KKT point, whose size can be arbitrarily reduced by refining the estimates of the value function and their gradients. We illustrate the performance of RL-SGF in a navigation example.

Paper Structure

This paper contains 8 sections, 7 theorems, 25 equations, 3 figures, 1 table, 1 algorithm.

Key Result

lemma 1

(Constraint satisfaction, KKT points, and convergence): Let $V_0$ and $V_1$ be Lipschitz on their domain of definition with Lipschitz constants $L_0$ and $L_1$ respectively, and assume $V_0$ is lower bounded. Let $\alpha>0$ and $h\in(0,\min\{ \frac{1}{\alpha},\frac{1}{L_0},\frac{1}{L_1} \} )$. For $

Figures (3)

  • Figure 1: Evolution of the policy $\pi_{\theta}$ generated by RL-SGF at diferent stages of the learning process for the single-integrator dynamics. The arrows indicate the mean $\mu_{\theta}$, the target state $x^*=(8,8)$ is marked in green and four trajectories starting at $(1,1),(5,5),(9,1)$ and $(1,9)$ are plotted in blue. Obstacles are depicted in red. From left to right: initial policy and after 200, 500, and 1500 iterations.
  • Figure 2: Comparison between RL-SGF, primal-dual approaches (PD), and Constrained Policy Optimization (CPO). Evolution of the average return $\widehat{V_0}(\theta)$ and safety measure $\widehat{V_1}(\theta)$ for single-integrator (left) and differential-drive (right) dynamics. The initial dual variable is denoted $\lambda_0$. Shaded areas represent 95% confidence intervals over 5 runs. The unsafe region ($\widehat{V_1}(\theta)>0$) is in gray.
  • Figure 3: Illustration of the performance of RL-SGF as a function of the number of episodes $N$ used in the estimates of the value functions and their gradients. Left plot shows the evolution of the average return $\widehat{V_0}(\theta)$ and right plot shows the safety measure $\widehat{V_1}(\theta)$ during training. Shaded areas are 95 $\%$ confidence intervals over 5 runs.

Theorems & Definitions (14)

  • lemma 1
  • proof
  • lemma 2
  • proof
  • lemma 3
  • proof
  • remark 1
  • proposition 1
  • proof
  • corollary 1
  • ...and 4 more