Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

George Papadopoulos; George A. Vouros

Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

George Papadopoulos, George A. Vouros

TL;DR

The devised Safe $Q$ Inverse Constrained Reinforcement Learning (SafeQIL) algorithm is compared to state-of-the art inverse constraint reinforcement learning algorithms to a set of challenging benchmark tasks, showing its merits.

Abstract

Given a set of trajectories demonstrating the execution of a task safely in a constrained MDP with observable rewards but with unknown constraints and non-observable costs, we aim to find a policy that maximizes the likelihood of demonstrated trajectories trading the balance between being conservative and increasing significantly the likelihood of high-rewarding trajectories but with potentially unsafe steps. Having these objectives, we aim towards learning a policy that maximizes the probability of the most $promising$ trajectories with respect to the demonstrations. In so doing, we formulate the ``promise" of individual state-action pairs in terms of $Q$ values, which depend on task-specific rewards as well as on the assessment of states' safety, mixing expectations in terms of rewards and safety. This entails a safe Q-learning perspective of the inverse learning problem under constraints: The devised Safe $Q$ Inverse Constrained Reinforcement Learning (SafeQIL) algorithm is compared to state-of-the art inverse constraint reinforcement learning algorithms to a set of challenging benchmark tasks, showing its merits.

Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

TL;DR

The devised Safe

Inverse Constrained Reinforcement Learning (SafeQIL) algorithm is compared to state-of-the art inverse constraint reinforcement learning algorithms to a set of challenging benchmark tasks, showing its merits.

Abstract

trajectories with respect to the demonstrations. In so doing, we formulate the ``promise" of individual state-action pairs in terms of

values, which depend on task-specific rewards as well as on the assessment of states' safety, mixing expectations in terms of rewards and safety. This entails a safe Q-learning perspective of the inverse learning problem under constraints: The devised Safe

Paper Structure (17 sections, 44 equations, 5 figures, 28 tables, 1 algorithm)

This paper contains 17 sections, 44 equations, 5 figures, 28 tables, 1 algorithm.

Introduction
Preliminaries and Motivation
Problem specification
Safe Q-Learning
Objective function
The SafeQIL algorithm
Experiments
Experimental settings
Experimental results
Ablation study
Related work
Conclusions
Quantitative Trade-off Analysis
More learning curves
Extended Ablation Study on SafeQIL Components
...and 2 more sections

Figures (5)

Figure 1: Snapshots of: (top-left) SafetyCarPush2‑v0, (top-right) SafetyPointCircle2-v0, (bottom-left) SafetyPointGoal1-v0, (bottom-right) SafetyCarButton1-v0.
Figure 2: Learning curves for SafetyPointGoal1-v0.
Figure 3: Learning curves for the SafetyPointCircle2-v0 benchmark.
Figure 4: Learning curves for the SafetyCarButton1-v0 benchmark.
Figure 5: Learning curves for the SafetyCarPush2-v0 benchmark.

Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

TL;DR

Abstract

Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (5)