Table of Contents
Fetching ...

Safe Exploration Using Bayesian World Models and Log-Barrier Optimization

Yarden As, Bhavya Sukhija, Andreas Krause

TL;DR

Safe Exploration Using Bayesian World Models and Log-Barrier Optimization tackles safe online reinforcement learning in CMDPs under partial observability. The method CERL builds an uncertainty-aware world model via a recurrent state space model and probabilistic ensembles, and enforces safety by optimizing within a pessimistic safe set using a log-barrier SGD objective. By evaluating on Safety-Gym with image observations, CERL demonstrates safe learning and improved sample efficiency relative to baselines while maintaining competitive final performance. This approach offers a practical pathway to deploy safe, data-efficient, vision-based RL in real-world environments, bridging theoretical guarantees and empirical performance.

Abstract

A major challenge in deploying reinforcement learning in online tasks is ensuring that safety is maintained throughout the learning process. In this work, we propose CERL, a new method for solving constrained Markov decision processes while keeping the policy safe during learning. Our method leverages Bayesian world models and suggests policies that are pessimistic w.r.t. the model's epistemic uncertainty. This makes CERL robust towards model inaccuracies and leads to safe exploration during learning. In our experiments, we demonstrate that CERL outperforms the current state-of-the-art in terms of safety and optimality in solving CMDPs from image observations.

Safe Exploration Using Bayesian World Models and Log-Barrier Optimization

TL;DR

Safe Exploration Using Bayesian World Models and Log-Barrier Optimization tackles safe online reinforcement learning in CMDPs under partial observability. The method CERL builds an uncertainty-aware world model via a recurrent state space model and probabilistic ensembles, and enforces safety by optimizing within a pessimistic safe set using a log-barrier SGD objective. By evaluating on Safety-Gym with image observations, CERL demonstrates safe learning and improved sample efficiency relative to baselines while maintaining competitive final performance. This approach offers a practical pathway to deploy safe, data-efficient, vision-based RL in real-world environments, bridging theoretical guarantees and empirical performance.

Abstract

A major challenge in deploying reinforcement learning in online tasks is ensuring that safety is maintained throughout the learning process. In this work, we propose CERL, a new method for solving constrained Markov decision processes while keeping the policy safe during learning. Our method leverages Bayesian world models and suggests policies that are pessimistic w.r.t. the model's epistemic uncertainty. This makes CERL robust towards model inaccuracies and leads to safe exploration during learning. In our experiments, we demonstrate that CERL outperforms the current state-of-the-art in terms of safety and optimality in solving CMDPs from image observations.
Paper Structure (18 sections, 5 equations, 2 figures)

This paper contains 18 sections, 5 equations, 2 figures.

Figures (2)

  • Figure 1: We average the accumulated costs for each training run. Error bars represent the standard deviation across all training runs. As shown, CERL outperforms the baseline algorithms with respect to the accumulated costs during training.
  • Figure 2: Learning curves for the objective and constraint for CERL and the baseline algorithms.