Table of Contents
Fetching ...

Safe Reinforcement Learning with Minimal Supervision

Alexander Quessy, Thomas Richardson, Sebastian East

TL;DR

This research demonstrates the significance of providing sufficient demonstrations for agents to learn optimal safe-RL policies online, and proposes optimistic forgetting, a novel online safe-RL approach that is practical for scenarios with limited data.

Abstract

Reinforcement learning (RL) in the real world necessitates the development of procedures that enable agents to explore without causing harm to themselves or others. The most successful solutions to the problem of safe RL leverage offline data to learn a safe-set, enabling safe online exploration. However, this approach to safe-learning is often constrained by the demonstrations that are available for learning. In this paper we investigate the influence of the quantity and quality of data used to train the initial safe learning problem offline on the ability to learn safe-RL policies online. Specifically, we focus on tasks with spatially extended goal states where we have few or no demonstrations available. Classically this problem is addressed either by using hand-designed controllers to generate data or by collecting user-generated demonstrations. However, these methods are often expensive and do not scale to more complex tasks and environments. To address this limitation we propose an unsupervised RL-based offline data collection procedure, to learn complex and scalable policies without the need for hand-designed controllers or user demonstrations. Our research demonstrates the significance of providing sufficient demonstrations for agents to learn optimal safe-RL policies online, and as a result, we propose optimistic forgetting, a novel online safe-RL approach that is practical for scenarios with limited data. Further, our unsupervised data collection approach highlights the need to balance diversity and optimality for safe online exploration.

Safe Reinforcement Learning with Minimal Supervision

TL;DR

This research demonstrates the significance of providing sufficient demonstrations for agents to learn optimal safe-RL policies online, and proposes optimistic forgetting, a novel online safe-RL approach that is practical for scenarios with limited data.

Abstract

Reinforcement learning (RL) in the real world necessitates the development of procedures that enable agents to explore without causing harm to themselves or others. The most successful solutions to the problem of safe RL leverage offline data to learn a safe-set, enabling safe online exploration. However, this approach to safe-learning is often constrained by the demonstrations that are available for learning. In this paper we investigate the influence of the quantity and quality of data used to train the initial safe learning problem offline on the ability to learn safe-RL policies online. Specifically, we focus on tasks with spatially extended goal states where we have few or no demonstrations available. Classically this problem is addressed either by using hand-designed controllers to generate data or by collecting user-generated demonstrations. However, these methods are often expensive and do not scale to more complex tasks and environments. To address this limitation we propose an unsupervised RL-based offline data collection procedure, to learn complex and scalable policies without the need for hand-designed controllers or user demonstrations. Our research demonstrates the significance of providing sufficient demonstrations for agents to learn optimal safe-RL policies online, and as a result, we propose optimistic forgetting, a novel online safe-RL approach that is practical for scenarios with limited data. Further, our unsupervised data collection approach highlights the need to balance diversity and optimality for safe online exploration.
Paper Structure (27 sections, 28 equations, 16 figures, 4 tables, 4 algorithms)

This paper contains 27 sections, 28 equations, 16 figures, 4 tables, 4 algorithms.

Figures (16)

  • Figure 1: Episodic return from the 3 navigation environments using the LMPC procedure outlined in LS$^{3}$. SPB requires a minimum of 125 demonstrations to learn a goal-reaching policy and SVB is unable to learn a goal-reaching policy at all.
  • Figure 2: Heatmap of the Safe-Set $f_{\mathcal{S}}(s)$ for SPB. Left: initial offline training. Top right: after 50 updates with optimistic forgetting. Bottom right: after 50 updates without optimistic forgetting.
  • Figure 3: Episodic return from the 3 navigation environments, trained using the procedure outlined in algorithm \ref{['alg:safe-set']}, using optimistic forgetting. For SPB we can learn an optimal policy from as few as 25 demonstrations and SVB is able to learn a goal-reaching policy with 125 demonstrations.
  • Figure 4: Unsupervised demonstrations for Safe-RL experiments. Constraint-violating trajectories are shown in red and goal-reaching trajectories in blue.
  • Figure 5: Episodic return using the LMPC procedure outlined in algorithm \ref{['alg:safe-set']} after initially training offline using the datasets depicted in figure \ref{['fig:D-unsupervised_demos']}.
  • ...and 11 more figures