Safe reinforcement learning in uncertain contexts

Dominik Baumann; Thomas B. Schön

Safe reinforcement learning in uncertain contexts

Dominik Baumann, Thomas B. Schön

TL;DR

The paper tackles safe reinforcement learning when a robot’s dynamics are affected by discrete, unmeasured contexts. It advances the field by (i) deriving frequentist, input-dependent bounds for multi-class classification using kernel mean embeddings, (ii) proposing a context-identification procedure with statistical guarantees based on maximum mean discrepancy, and (iii) integrating these components with a SafeOpt-based safe-learning loop to guarantee safety during exploration. Theory is complemented by a Furuta pendulum experiment where camera-based context cues are used to infer object weight, demonstrating safety preservation and potential improvements when contexts are reliably distinguishable. The findings enable robust, data-driven safe RL in settings where context is uncertain or partially observable, with broader implications for safe decision-making in robotics and automated systems.

Abstract

When deploying machine learning algorithms in the real world, guaranteeing safety is an essential asset. Existing safe learning approaches typically consider continuous variables, i.e., regression tasks. However, in practice, robotic systems are also subject to discrete, external environmental changes, e.g., having to carry objects of certain weights or operating on frozen, wet, or dry surfaces. Such influences can be modeled as discrete context variables. In the existing literature, such contexts are, if considered, mostly assumed to be known. In this work, we drop this assumption and show how we can perform safe learning when we cannot directly measure the context variables. To achieve this, we derive frequentist guarantees for multi-class classification, allowing us to estimate the current context from measurements. Further, we propose an approach for identifying contexts through experiments. We discuss under which conditions we can retain theoretical guarantees and demonstrate the applicability of our algorithm on a Furuta pendulum with camera measurements of different weights that serve as contexts.

Safe reinforcement learning in uncertain contexts

TL;DR

Abstract

Paper Structure (23 sections, 8 theorems, 22 equations, 9 figures, 1 algorithm)

This paper contains 23 sections, 8 theorems, 22 equations, 9 figures, 1 algorithm.

Introduction
Related work
Problem setting and background
Background
Problem setting
Preliminaries
Context identification
Classification
Safe reinforcement learning in uncertain contexts
Context identification with guarantees
A classifier with frequentist bounds
Safe learning
Evaluation
Safe reinforcement learning in uncertain contexts
Comparison
...and 8 more sections

Key Result

Theorem 1

The classifier eqn:class_prob_cme is consistent if $k(y,\cdot)$ is in the image of $R_{YY}$.

Figures (9)

Figure 1: Our experimental setup. We aim at optimizing a balancing controller for a Furuta pendulum whose dynamics can be altered by adding (removing) weights to (from) its pole. Our algorithm tries to infer the current weight from image data and resorts to identifying it through dedicated experiments if the image data is not sufficiently informative.
Figure 2: Trajectories of context identification experiments.
Figure 3: MMD of context identification experiments for different weights for varying $a$. For $a>50$, we see that the MMD is at a low level, i.e., trajectories are approximately independent.
Figure 4: Prediction of weights based on camera images. We show the prediction $\hat{p}_\mathrm{c}$ and uncertainty intervals from Corollary \ref{['cor:prob_uncertainty']} for ten images without weight (left), ten with weight one (middle), and ten with weight two (right). Wrong predictions are marked with red crosses. From top to bottom, we see how the uncertainty intervals decrease from a data set of ten images, over one with around 150 images (middle), to the full data set of 312 images.
Figure 5: Training time and number of failures for pure SafeOpt, SafeOpt with context identification and classification (our), and SafeOpt with context identification before the start of every experiment (ContID). While our extensions require more samples for the additional context identification and, therefore, more training time, we do not incur any failures while we have several when only using SafeOpt.
...and 4 more figures

Theorems & Definitions (18)

Theorem 1: hsu2018hyperparameter
Remark 1
Definition 1
Proposition 1
proof
Definition 2
Lemma 1
proof
Lemma 2
proof
...and 8 more

Safe reinforcement learning in uncertain contexts

TL;DR

Abstract

Safe reinforcement learning in uncertain contexts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (18)