Table of Contents
Fetching ...

MPC4RL -- A Software Package for Reinforcement Learning based on Model Predictive Control

Dirk Reinhardt, Katrin Baumgärnter, Jonathan Frey, Moritz Diehl, Sebastien Gros

TL;DR

The paper addresses the lack of open-source tools for reinforcement-learning-based MPC and introduces MPC4RL, an open-source Python package that links acados with Gymnasium and stable-baselines3 to enable learning-enabled MPC. It extends acados to provide parametric NLP sensitivities, enabling efficient computation of $\nabla_\theta V_\theta(s)$ and $\nabla_\theta Q_\theta(s,a)$ needed by RL methods that use MPC as a function approximator. The authors demonstrate that policy-gradient evaluations via these sensitivities are about an order of magnitude faster than general-purpose approaches, as shown in two case studies. The work is modular and extensible, released on GitHub, with plans for parallel sensitivity evaluation, warm starting, and broader RL algorithm support.

Abstract

In this paper, we present an early software integrating Reinforcement Learning (RL) with Model Predictive Control (MPC). Our aim is to make recent theoretical contributions from the literature more accessible to both the RL and MPC communities. We combine standard software tools developed by the RL community, such as Gymnasium, stable-baselines3, or CleanRL with the acados toolbox, a widely-used software package for efficient MPC algorithms. Our core contribution is MPC4RL, an open-source Python package that supports learning-enhanced MPC schemes for existing acados implementations. The package is designed to be modular, extensible, and user-friendly, facilitating the tuning of MPC algorithms for a broad range of control problems. It is available on GitHub.

MPC4RL -- A Software Package for Reinforcement Learning based on Model Predictive Control

TL;DR

The paper addresses the lack of open-source tools for reinforcement-learning-based MPC and introduces MPC4RL, an open-source Python package that links acados with Gymnasium and stable-baselines3 to enable learning-enabled MPC. It extends acados to provide parametric NLP sensitivities, enabling efficient computation of and needed by RL methods that use MPC as a function approximator. The authors demonstrate that policy-gradient evaluations via these sensitivities are about an order of magnitude faster than general-purpose approaches, as shown in two case studies. The work is modular and extensible, released on GitHub, with plans for parallel sensitivity evaluation, warm starting, and broader RL algorithm support.

Abstract

In this paper, we present an early software integrating Reinforcement Learning (RL) with Model Predictive Control (MPC). Our aim is to make recent theoretical contributions from the literature more accessible to both the RL and MPC communities. We combine standard software tools developed by the RL community, such as Gymnasium, stable-baselines3, or CleanRL with the acados toolbox, a widely-used software package for efficient MPC algorithms. Our core contribution is MPC4RL, an open-source Python package that supports learning-enhanced MPC schemes for existing acados implementations. The package is designed to be modular, extensible, and user-friendly, facilitating the tuning of MPC algorithms for a broad range of control problems. It is available on GitHub.

Paper Structure

This paper contains 23 sections, 19 equations, 4 figures.

Figures (4)

  • Figure 1: Communication between the main components of the MPC4RL package. The AcadosOcpSolver communicates the primal-dual solution to the NLP class, which computes the sensitivities of the NLP solutions. The RL algorithm requests samples from the replay buffer and updates the parameters of the AcadosOcpSolver (policy). The environment communicates the state-action-cost transitions to the replay buffer. The different colors indicate the external code bases used to implement the different components with CasADi/acados (red), stable-baselines3 (green), and Gymnasium (blue). The NLP object replicating the NLP structure is used to compute the sensitivities of the NLP solutions. It is not needed when computing the sensitivities with acados.
  • Figure 2: Total time needed to compute the policy gradient for the chain-mass system with varying number of masses.
  • Figure 3: Results before and after training. State-constraint violations and corresponding large control actions are avoided after training by backing off the constraint (and reference) at the origin.
  • Figure 4: Evolution of the parameters and the accumulated cost for each episode under the MPC-based policy $\pi^\mathrm{MPC}$.