Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning

Lunet Yifru; Ali Baheri

Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning

Lunet Yifru, Ali Baheri

TL;DR

A novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints.

Abstract

Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints.

Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning

TL;DR

Abstract

Paper Structure (17 sections, 27 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 27 equations, 7 figures, 5 tables, 1 algorithm.

INTRODUCTION
Related Work
Preliminaries
Signal Temporal Logic
Reinforcement Learning
Bayesian Optimization
Problem Statement and Formulation
Methodology
STL Constraint Parameter Learning
Policy Learning
Human Feedback Mechanism
Case Studies
Case Study 1: Safe Navigation - Circle
Case Study 2: Safe Navigation - Goal
Case Study 3: Safe Velocity - Half Cheetah
...and 2 more sections

Figures (7)

Figure 1: Schematic representation of the integrated framework for concurrently learning STL constraint parameters and optimal policies. The framework applies BO for STL parameter mining, TD3-Lagrangian for policy learning, and incorporates human expert for labeling rollout traces to be used in refining the learned constraint parameters and policy. Once the percentage of safe traces in a rollout dataset $\alpha$ is higher than the threshold value $\delta$, convergence is achieved, and the final policy and STL constraint are extracted.
Figure 2: Circular navigation environment with 2 boundaries in the $x$ direction (in yellow) and the safe navigation area (in green).
Figure 3: Goal navigation environment with eight hazards (in blue), and one goal location (in green).
Figure 4: Safe velocity test environment with the half cheetah agent.
Figure 5: BO learning curve for parameter learning of pSTL specifications provided in case studies. \ref{['fig:bospc']} depicts the learning curve for optimizing two parameters, \ref{['fig:bospg']} shows the learning curve for optimizing 16 parameters, and \ref{['fig:boshcv']} depicts the learning curve for optimizing two parameters. The minimization metric is given as the balanced misclassification rate (MCR) of the STL at sequentially generated candidate points.
...and 2 more figures

Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning

TL;DR

Abstract

Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)