Table of Contents
Fetching ...

Progressive Safeguards for Safe and Model-Agnostic Reinforcement Learning

Nabil Omi, Hosein Hasanbeig, Hiteshi Sharma, Sriram K. Rajamani, Siddhartha Sen

TL;DR

This framework is inspired by how parents safeguard their children across a progression of increasingly riskier tasks, imparting a sense of safety that is carried over from task to task, and gives rise to an end-to-end safe learning approach with wide applicability.

Abstract

In this paper we propose a formal, model-agnostic meta-learning framework for safe reinforcement learning. Our framework is inspired by how parents safeguard their children across a progression of increasingly riskier tasks, imparting a sense of safety that is carried over from task to task. We model this as a meta-learning process where each task is synchronized with a safeguard that monitors safety and provides a reward signal to the agent. The safeguard is implemented as a finite-state machine based on a safety specification; the reward signal is formally shaped around this specification. The safety specification and its corresponding safeguard can be arbitrarily complex and non-Markovian, which adds flexibility to the training process and explainability to the learned policy. The design of the safeguard is manual but it is high-level and model-agnostic, which gives rise to an end-to-end safe learning approach with wide applicability, from pixel-level game control to language model fine-tuning. Starting from a given set of safety specifications (tasks), we train a model such that it can adapt to new specifications using only a small number of training samples. This is made possible by our method for efficiently transferring safety bias between tasks, which effectively minimizes the number of safety violations. We evaluate our framework in a Minecraft-inspired Gridworld, a VizDoom game environment, and an LLM fine-tuning application. Agents trained with our approach achieve near-minimal safety violations, while baselines are shown to underperform.

Progressive Safeguards for Safe and Model-Agnostic Reinforcement Learning

TL;DR

This framework is inspired by how parents safeguard their children across a progression of increasingly riskier tasks, imparting a sense of safety that is carried over from task to task, and gives rise to an end-to-end safe learning approach with wide applicability.

Abstract

In this paper we propose a formal, model-agnostic meta-learning framework for safe reinforcement learning. Our framework is inspired by how parents safeguard their children across a progression of increasingly riskier tasks, imparting a sense of safety that is carried over from task to task. We model this as a meta-learning process where each task is synchronized with a safeguard that monitors safety and provides a reward signal to the agent. The safeguard is implemented as a finite-state machine based on a safety specification; the reward signal is formally shaped around this specification. The safety specification and its corresponding safeguard can be arbitrarily complex and non-Markovian, which adds flexibility to the training process and explainability to the learned policy. The design of the safeguard is manual but it is high-level and model-agnostic, which gives rise to an end-to-end safe learning approach with wide applicability, from pixel-level game control to language model fine-tuning. Starting from a given set of safety specifications (tasks), we train a model such that it can adapt to new specifications using only a small number of training samples. This is made possible by our method for efficiently transferring safety bias between tasks, which effectively minimizes the number of safety violations. We evaluate our framework in a Minecraft-inspired Gridworld, a VizDoom game environment, and an LLM fine-tuning application. Agents trained with our approach achieve near-minimal safety violations, while baselines are shown to underperform.

Paper Structure

This paper contains 13 sections, 2 theorems, 19 equations, 13 figures, 1 table, 1 algorithm.

Key Result

Theorem 6

In an MDP $\mathfrak{M}$ with a bounded reward function and a finite action space optimal policies are stationary and deterministic.

Figures (13)

  • Figure 1: Transfer of safety bias in parent-child interplay.
  • Figure 2: A simplified depiction of safeguarded learning.
  • Figure 3: A stochastically-labelled Minecraft environment over which various safety specifications can be defined. The transparency level of each object corresponds to the probability of that object being observed in that location.
  • Figure 4: An example of progressive safeguards. The green states are accepting states, i.e., the set $\mathcal{F}$ in Definition \ref{['def:safe_guard']}. An edge with label $\texttt{true}$ reads any label from the power set $2^{\mathcal{L}}$, and an edge with label $\texttt{else}$ reads any label from $2^{\mathcal{L}}$ except those that are outgoing from its node. Note that by reading labels that are unsafe with respect to the specification, the safeguard moves to a rejecting sink component (Definition \ref{['def:sinks']}). As per Basic Safeguards and also Safeguard 1, interaction with $\texttt{lava}$ or $\texttt{creeper}$ is unsafe. However, Safeguard 2 allows the agent to interact with $\texttt{lava}$ after it collected $\texttt{wood}$ and went to $\texttt{workbench}$ (to create a bridge for instance). Similarly, Safeguard 3 prescribes that if the agent collects $\texttt{wood}$, $\texttt{iron}$, and goes $\texttt{smithtable}$ (to create a sword for instance), then dealing with $\texttt{creeper}$ is safe.
  • Figure 5: Minecraft experiment over 10 runs. (a) Convergence of the expected return. (b) Cumulative number of safety violations using only Safeguard 3 (zero-shot case) compared to our approach (PSL); plots for intrinsic fear and the RL baseline are omitted for a clearer comparison, as they incur orders of magnitude more violations.
  • ...and 8 more figures

Theorems & Definitions (14)

  • Definition 1: Markov Decision Process (MDP)
  • Definition 3: MDP Stationary Policy
  • Definition 4: Expected Discounted Return
  • Definition 5: Optimal Policy
  • Theorem 6: puterman
  • Definition 7: Path
  • Definition 8: Safeguard
  • Definition 9: Rejecting Sink Component
  • Definition 10: Fictitious Safeguarded MDP
  • Example 11: Minecraft Gridworld Environment
  • ...and 4 more