Table of Contents
Fetching ...

Sampling-based Safe Reinforcement Learning for Nonlinear Dynamical Systems

Wesley A. Suttle, Vipul K. Sharma, Krishna C. Kosaraju, S. Sivaranjani, Ji Liu, Vijay Gupta, Brian M. Sadler

TL;DR

A single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment is developed.

Abstract

We develop provably safe and convergent reinforcement learning (RL) algorithms for control of nonlinear dynamical systems, bridging the gap between the hard safety guarantees of control theory and the convergence guarantees of RL theory. Recent advances at the intersection of control and RL follow a two-stage, safety filter approach to enforcing hard safety constraints: model-free RL is used to learn a potentially unsafe controller, whose actions are projected onto safe sets prescribed, for example, by a control barrier function. Though safe, such approaches lose any convergence guarantees enjoyed by the underlying RL methods. In this paper, we develop a single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment. We validate the efficacy of our approach in simulation, including safe control of a quadcopter in a challenging obstacle avoidance problem, and demonstrate that it outperforms existing benchmarks.

Sampling-based Safe Reinforcement Learning for Nonlinear Dynamical Systems

TL;DR

A single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment is developed.

Abstract

We develop provably safe and convergent reinforcement learning (RL) algorithms for control of nonlinear dynamical systems, bridging the gap between the hard safety guarantees of control theory and the convergence guarantees of RL theory. Recent advances at the intersection of control and RL follow a two-stage, safety filter approach to enforcing hard safety constraints: model-free RL is used to learn a potentially unsafe controller, whose actions are projected onto safe sets prescribed, for example, by a control barrier function. Though safe, such approaches lose any convergence guarantees enjoyed by the underlying RL methods. In this paper, we develop a single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment. We validate the efficacy of our approach in simulation, including safe control of a quadcopter in a challenging obstacle avoidance problem, and demonstrate that it outperforms existing benchmarks.
Paper Structure (32 sections, 8 theorems, 40 equations, 6 figures, 2 algorithms)

This paper contains 32 sections, 8 theorems, 40 equations, 6 figures, 2 algorithms.

Key Result

Proposition 1

Under Assumptions assum:volume_preservation, assum:positive_probability, assum:positive_volume_reachability, for given $\theta$ and any subset $\mathcal{B} \subset \mathcal{S}$ satisfying $\mu(\mathcal{B}) > 0$, the Markov chain induced by $\pi^C_{\theta}$ on $\mathcal{S}$ enters $\mathcal{B}$ with

Figures (6)

  • Figure 1: Safety and convergence of CBF-Constrained Beta policies: Agent was trained with PPO on the quadrotor navigation problem with an obstacle. Safety was maintained and goal was eventually reached.
  • Figure 3: CBF-constrained Beta vs. unconstrained Gaussian on inverted pendulum environment with safe set $\mathcal{S}_{0.5} = \{ \theta \ | \ -0.5 \leq \theta \leq 0.5 \}$. "Safety Rate" denotes percentage of time spent in safe set. Curves present mean and 95% confidence intervals over 5 replications.
  • Figure 4: PPO hyperparameters for the inverted pendulum experiments.
  • Figure 5: Comparison of safety-constrained Beta policy and unconstrained Gaussian policy on the inverted pendulum environment with constraint set $\mathcal{S}_{1.0} = \{ \theta \ | \ -1.0 \leq \theta \leq 1.0 \}$. The top figure presents learning curves, while the bottom figure presents the "safety rate", i.e., the percentage of time spent in $\mathcal{S}_{1.0}$ over the course of the episode. The curves represent means and 95% confidence intervals over five independent replications.
  • Figure 6: PPO hyperparameters for the quadcopter experiments.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Definition 1
  • Proposition 1
  • Corollary 1
  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Remark 1
  • Theorem 2
  • Theorem 3
  • Corollary 2