Sampling-based Safe Reinforcement Learning for Nonlinear Dynamical Systems

Wesley A. Suttle; Vipul K. Sharma; Krishna C. Kosaraju; S. Sivaranjani; Ji Liu; Vijay Gupta; Brian M. Sadler

Sampling-based Safe Reinforcement Learning for Nonlinear Dynamical Systems

Wesley A. Suttle, Vipul K. Sharma, Krishna C. Kosaraju, S. Sivaranjani, Ji Liu, Vijay Gupta, Brian M. Sadler

TL;DR

A single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment is developed.

Abstract

We develop provably safe and convergent reinforcement learning (RL) algorithms for control of nonlinear dynamical systems, bridging the gap between the hard safety guarantees of control theory and the convergence guarantees of RL theory. Recent advances at the intersection of control and RL follow a two-stage, safety filter approach to enforcing hard safety constraints: model-free RL is used to learn a potentially unsafe controller, whose actions are projected onto safe sets prescribed, for example, by a control barrier function. Though safe, such approaches lose any convergence guarantees enjoyed by the underlying RL methods. In this paper, we develop a single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment. We validate the efficacy of our approach in simulation, including safe control of a quadcopter in a challenging obstacle avoidance problem, and demonstrate that it outperforms existing benchmarks.

Sampling-based Safe Reinforcement Learning for Nonlinear Dynamical Systems

TL;DR

Abstract

Paper Structure (32 sections, 8 theorems, 40 equations, 6 figures, 2 algorithms)

This paper contains 32 sections, 8 theorems, 40 equations, 6 figures, 2 algorithms.

INTRODUCTION
Related Work
PROBLEM SETTING
THEORETICAL RESULTS
Discounted Return is Well-defined
Policy Gradients
Algorithm
Convergence
EXPERIMENTAL RESULTS
CBF-Constrained Beta Policies
Implementation
Case study 1 : Quadcopter Navigation
Case study 2: Inverted pendulum
CONCLUSION
Proofs
...and 17 more sections

Key Result

Proposition 1

Under Assumptions assum:volume_preservation, assum:positive_probability, assum:positive_volume_reachability, for given $\theta$ and any subset $\mathcal{B} \subset \mathcal{S}$ satisfying $\mu(\mathcal{B}) > 0$, the Markov chain induced by $\pi^C_{\theta}$ on $\mathcal{S}$ enters $\mathcal{B}$ with

Figures (6)

Figure 1: Safety and convergence of CBF-Constrained Beta policies: Agent was trained with PPO on the quadrotor navigation problem with an obstacle. Safety was maintained and goal was eventually reached.
Figure 3: CBF-constrained Beta vs. unconstrained Gaussian on inverted pendulum environment with safe set $\mathcal{S}_{0.5} = \{ \theta \ | \ -0.5 \leq \theta \leq 0.5 \}$. "Safety Rate" denotes percentage of time spent in safe set. Curves present mean and 95% confidence intervals over 5 replications.
Figure 4: PPO hyperparameters for the inverted pendulum experiments.
Figure 5: Comparison of safety-constrained Beta policy and unconstrained Gaussian policy on the inverted pendulum environment with constraint set $\mathcal{S}_{1.0} = \{ \theta \ | \ -1.0 \leq \theta \leq 1.0 \}$. The top figure presents learning curves, while the bottom figure presents the "safety rate", i.e., the percentage of time spent in $\mathcal{S}_{1.0}$ over the course of the episode. The curves represent means and 95% confidence intervals over five independent replications.
Figure 6: PPO hyperparameters for the quadcopter experiments.
...and 1 more figures

Theorems & Definitions (10)

Definition 1
Proposition 1
Corollary 1
Lemma 1
Lemma 2
Theorem 1
Remark 1
Theorem 2
Theorem 3
Corollary 2

Sampling-based Safe Reinforcement Learning for Nonlinear Dynamical Systems

TL;DR

Abstract

Sampling-based Safe Reinforcement Learning for Nonlinear Dynamical Systems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (10)