Table of Contents
Fetching ...

Reinforcement Learning with Ensemble Model Predictive Safety Certification

Sven Gronauer, Tom Haider, Felippe Schmoeller da Roza, Klaus Diepold

TL;DR

The paper tackles the challenge of safe exploration in deep reinforcement learning for safety-critical tasks by introducing Ensemble Model Predictive Safety Certification (X-MPSC). X-MPSC combines an ensemble of probabilistic dynamics models with tube-based model predictive control to certify and potentially modify the learner's actions, ensuring all safety constraints are respected over a planning horizon. The method demonstrates substantially fewer constraint violations than competitive baselines, and the use of a coarse prior dynamics model can reduce violations by an order of magnitude without harming performance. The approach relies on offline data from a safe backup controller to bootstrap training and leverages ellipsoidal uncertainty and recursive feasibility to maintain safety during learning, offering a practical path toward safe real-world deployment in robotics and related domains.

Abstract

Reinforcement learning algorithms need exploration to learn. However, unsupervised exploration prevents the deployment of such algorithms on safety-critical tasks and limits real-world deployment. In this paper, we propose a new algorithm called Ensemble Model Predictive Safety Certification that combines model-based deep reinforcement learning with tube-based model predictive control to correct the actions taken by a learning agent, keeping safety constraint violations at a minimum through planning. Our approach aims to reduce the amount of prior knowledge about the actual system by requiring only offline data generated by a safe controller. Our results show that we can achieve significantly fewer constraint violations than comparable reinforcement learning methods.

Reinforcement Learning with Ensemble Model Predictive Safety Certification

TL;DR

The paper tackles the challenge of safe exploration in deep reinforcement learning for safety-critical tasks by introducing Ensemble Model Predictive Safety Certification (X-MPSC). X-MPSC combines an ensemble of probabilistic dynamics models with tube-based model predictive control to certify and potentially modify the learner's actions, ensuring all safety constraints are respected over a planning horizon. The method demonstrates substantially fewer constraint violations than competitive baselines, and the use of a coarse prior dynamics model can reduce violations by an order of magnitude without harming performance. The approach relies on offline data from a safe backup controller to bootstrap training and leverages ellipsoidal uncertainty and recursive feasibility to maintain safety during learning, offering a practical path toward safe real-world deployment in robotics and related domains.

Abstract

Reinforcement learning algorithms need exploration to learn. However, unsupervised exploration prevents the deployment of such algorithms on safety-critical tasks and limits real-world deployment. In this paper, we propose a new algorithm called Ensemble Model Predictive Safety Certification that combines model-based deep reinforcement learning with tube-based model predictive control to correct the actions taken by a learning agent, keeping safety constraint violations at a minimum through planning. Our approach aims to reduce the amount of prior knowledge about the actual system by requiring only offline data generated by a safe controller. Our results show that we can achieve significantly fewer constraint violations than comparable reinforcement learning methods.
Paper Structure (45 sections, 25 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 45 sections, 25 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: uses multi-step planning with ellipsoidal uncertainty estimates. (Left) Ellipsoidal uncertainty propagation with a single model. (Right) Tube-based predictions generated by an ensemble of models. By utilizing multiple models, an unsafe action $u_t$ (red) is corrected to $v_0$ (blue) that keeps the system within the safety constraints over the horizon $N$.
  • Figure 2: Experimental results. Thick lines show the average over five independent seeds and the shaded area denotes the standard deviation. (Top) The cumulative reward of one episode reported over the total environment steps. (Bottom) Total constraint violations over the whole training.
  • Figure 3: Impact of X-MPSC hyper-parameters on safety and performance in Simple Pendulum.