Table of Contents
Fetching ...

Safe Reinforcement Learning Using Robust MPC

Mario Zanon, Sébastien Gros

TL;DR

This paper tackles safety in reinforcement learning by integrating robust model predictive control (MPC) as a function-approximation mechanism within RL. The authors introduce Safe RL-MPC, which uses a tube-based robust MPC with a parametrizable uncertainty set and a Safe Design Constraint to guarantee constraint satisfaction during both learning and deployment. Key contributions include a data-efficient approach to manage large data streams via a nominal linear model and convex-hull data compression, a differentiable MPC scheme for gradient-based RL updates, and algorithms for safe exploration and recursive feasibility. The framework is demonstrated on a linear system and a nonlinear evaporation process, showing how RL can adapt the uncertainty set and the MPC parameters to improve performance while preserving safety. The work provides a foundation for extending safe RL with scenario trees, stochastic gradients, and hybrid cost formulations for broader real-world applications.

Abstract

Reinforcement Learning (RL) has recently impressed the world with stunning results in various applications. While the potential of RL is now well-established, many critical aspects still need to be tackled, including safety and stability issues. These issues, while partially neglected by the RL community, are central to the control community which has been widely investigating them. Model Predictive Control (MPC) is one of the most successful control techniques because, among others, of its ability to provide such guarantees even for uncertain constrained systems. Since MPC is an optimization-based technique, optimality has also often been claimed. Unfortunately, the performance of MPC is highly dependent on the accuracy of the model used for predictions. In this paper, we propose to combine RL and MPC in order to exploit the advantages of both and, therefore, obtain a controller which is optimal and safe. We illustrate the results with a numerical example in simulations.

Safe Reinforcement Learning Using Robust MPC

TL;DR

This paper tackles safety in reinforcement learning by integrating robust model predictive control (MPC) as a function-approximation mechanism within RL. The authors introduce Safe RL-MPC, which uses a tube-based robust MPC with a parametrizable uncertainty set and a Safe Design Constraint to guarantee constraint satisfaction during both learning and deployment. Key contributions include a data-efficient approach to manage large data streams via a nominal linear model and convex-hull data compression, a differentiable MPC scheme for gradient-based RL updates, and algorithms for safe exploration and recursive feasibility. The framework is demonstrated on a linear system and a nonlinear evaporation process, showing how RL can adapt the uncertainty set and the MPC parameters to improve performance while preserving safety. The work provides a foundation for extending safe RL with scenario trees, stochastic gradients, and hybrid cost formulations for broader real-world applications.

Abstract

Reinforcement Learning (RL) has recently impressed the world with stunning results in various applications. While the potential of RL is now well-established, many critical aspects still need to be tackled, including safety and stability issues. These issues, while partially neglected by the RL community, are central to the control community which has been widely investigating them. Model Predictive Control (MPC) is one of the most successful control techniques because, among others, of its ability to provide such guarantees even for uncertain constrained systems. Since MPC is an optimization-based technique, optimality has also often been claimed. Unfortunately, the performance of MPC is highly dependent on the accuracy of the model used for predictions. In this paper, we propose to combine RL and MPC in order to exploit the advantages of both and, therefore, obtain a controller which is optimal and safe. We illustrate the results with a numerical example in simulations.

Paper Structure

This paper contains 26 sections, 8 theorems, 67 equations, 4 figures, 1 algorithm.

Key Result

Proposition 1

Assume that $\mathcal{X}_\mathrm{f}:=\{ \, \boldsymbol{\mathrm{x}} \, | \, G\boldsymbol{\mathrm{x}} + \boldsymbol{\mathrm{g}} \leq 0 \, \}$ is RPI and Problem eq:robust_mpc is feasible at time $k=0$. Then Problem eq:robust_mpc is feasible for all $\boldsymbol{\mathrm{w}}_k \in\boldsymbol{\mathrm{W}

Figures (4)

  • Figure 1: Schematics of the proposed setup: data are used to construct the SDC based on $\mathcal{\bar{W}}$ and to evaluate the cost in \ref{['eq:rl_problem_sampled']}. This cost depends on $Q_{\boldsymbol{\mathrm{\theta}}},V_{\boldsymbol{\mathrm{\theta}}}$ obtained from MPC, and $\ell$. MPC controls the system. The signal toggling between exploitation and exploration is omitted to avoid confusion, and is a signal sent from RL to switch between MPC \ref{['eq:robust_mpc']} and \ref{['eq:exploration']}.
  • Figure 2: Snapshots at $k=0$ and $k=24$. Top figure (state space): state constraint set (light blue), state-input constraints using feedback matrix $K$ (green), RPI set (red), terminal constraint set $\mathcal{X}_\mathrm{f}$ (cyan for $k=24$, transparent for $k=0$). Predicted trajectory at $k=24$: initial state (red dot), predicted trajectory (solid black line), reference $\boldsymbol{\mathrm{s}}^\mathrm{r}$ (black circle), uncertainty tube (yellow). Bottom figure (noise space): true uncertainty set (transparent octagon), noise samples (black dots), vertices of their convex hull (red dots), uncertainty set approximations $\boldsymbol{\mathrm{W}}_{\boldsymbol{\mathrm{\theta}}}$ (cyan sets, with $k=0$ in the background). A better approximation $\boldsymbol{\mathrm{W}}_{\boldsymbol{\mathrm{\theta}}}$ ($k=24$) enlarges $\mathcal{X}_\mathrm{f}$.
  • Figure 3: Snapshots at $k=34$ and $k=109$: same convention as Figure \ref{['fig:1_25']}, with predicted trajectory at $k=109$. Both the RPI and terminal sets moved closer to the setpoint by a better approximation $\boldsymbol{\mathrm{W}}_{\boldsymbol{\mathrm{\theta}}}$ for the specific control task (see the bottom plot). Moreover, next to $\boldsymbol{\mathrm{s}}^\mathrm{r}$ the constraints are tightened more at $k=36$ than afterwards.
  • Figure 4: Evaporation process. Top two figures: RPI and terminal set at the beginning and end of the learning process, same color convention as Figure \ref{['fig:1_25']}. Bottom figure: the uncertainty set approximation $\boldsymbol{\mathrm{W}}_{\boldsymbol{\mathrm{\theta}}}$.

Theorems & Definitions (22)

  • Definition 1: $\eta$-safe Policy
  • Remark 1
  • Remark 2
  • Proposition 1: Recursive Feasibility
  • Lemma 1
  • Corollary 1
  • Remark 3
  • Lemma 2
  • Remark 4
  • Remark 5
  • ...and 12 more