Table of Contents
Fetching ...

Learning safety in model-based Reinforcement Learning using MPC and Gaussian Processes

Filippo Airaldi, Bart De Schutter, Azita Dabiri

TL;DR

Addresses safety in model-based RL with MPC under unknown dynamics. Proposes GP-based estimation of unknown safety constraints via $z(theta)$ and a data-driven safe set $S_D$ with probability $beta$. Implements an algorithm that runs MPC with current parameters, updates GP constraints from observed trajectories, and performs constrained RL updates, employing backtracking to ensure feasibility. Demonstrates on a quadrotor with wind disturbances that GP-guided safety reduces unsafe episodes by about 57–64% and yields faster convergence while staying near baseline performance, offering a practical data-driven safety mechanism for learning-based MPC.

Abstract

We propose a method to encourage safety in Model Predictive Control (MPC)-based Reinforcement Learning (RL) via Gaussian Process (GP) regression. This framework consists of 1) a parametric MPC scheme that is employed as model-based controller with approximate knowledge on the real system's dynamics, 2) an episodic RL algorithm tasked with adjusting the MPC parametrization in order to increase its performance, and lastly, 3) GP regressors used to estimate, directly from data, constraints on the MPC parameters capable of predicting, up to some probability, whether the parametrization is likely to yield a safe or unsafe policy. These constraints are then enforced onto the RL updates in an effort to enhance the learning method with a probabilistic safety mechanism. Compared to other recent publications combining safe RL with MPC, our method does not require further assumptions on, e.g., the prediction model in order to retain computational tractability. We illustrate the results of our method in a numerical example on the control of a quadrotor drone in a safety-critical environment.

Learning safety in model-based Reinforcement Learning using MPC and Gaussian Processes

TL;DR

Addresses safety in model-based RL with MPC under unknown dynamics. Proposes GP-based estimation of unknown safety constraints via and a data-driven safe set with probability . Implements an algorithm that runs MPC with current parameters, updates GP constraints from observed trajectories, and performs constrained RL updates, employing backtracking to ensure feasibility. Demonstrates on a quadrotor with wind disturbances that GP-guided safety reduces unsafe episodes by about 57–64% and yields faster convergence while staying near baseline performance, offering a practical data-driven safety mechanism for learning-based MPC.

Abstract

We propose a method to encourage safety in Model Predictive Control (MPC)-based Reinforcement Learning (RL) via Gaussian Process (GP) regression. This framework consists of 1) a parametric MPC scheme that is employed as model-based controller with approximate knowledge on the real system's dynamics, 2) an episodic RL algorithm tasked with adjusting the MPC parametrization in order to increase its performance, and lastly, 3) GP regressors used to estimate, directly from data, constraints on the MPC parameters capable of predicting, up to some probability, whether the parametrization is likely to yield a safe or unsafe policy. These constraints are then enforced onto the RL updates in an effort to enhance the learning method with a probabilistic safety mechanism. Compared to other recent publications combining safe RL with MPC, our method does not require further assumptions on, e.g., the prediction model in order to retain computational tractability. We illustrate the results of our method in a numerical example on the control of a quadrotor drone in a safety-critical environment.
Paper Structure (12 sections, 20 equations, 3 figures, 1 table)

This paper contains 12 sections, 20 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Comparison in performance between the original LSTD Q-learning algorithm and its GP-based safe variant (with and without prior knowledge). In dashed black, the baseline cumulative cost when exact knowledge of the system is given to the MPC controller.
  • Figure 2: Comparison in (top) violations of the altitude constraint, where positive values imply violation, and (bottom) the cumulative number of unsafe episodes.
  • Figure 3: Backtracked safety probability $\beta$ during learning.