Table of Contents
Fetching ...

Synthesis of Model Predictive Control and Reinforcement Learning: Survey and Classification

Rudolf Reiter, Jasper Hoffmann, Dirk Reinhardt, Florian Messerer, Katrin Baumgärtner, Shamburaj Sawant, Joschka Boedecker, Moritz Diehl, Sebastien Gros

TL;DR

This survey addresses the long-standing divide between Model Predictive Control and Reinforcement Learning by articulating their fundamental similarities and orthogonal strengths. It introduces a taxonomy for combining MPC and RL through an actor-critic lens, covering expert-MPC, MPC-in-the-deployed-policy, and MPC-as-critic paradigms, along with architectures for parameterized MPC. The authors synthesize theoretical connections, notably MPC–MDP equivalence results, and survey a broad literature spanning aligned learning, closed-loop learning, and pre-/postprocessing, while highlighting practical software tools. The work aims to guide researchers and practitioners in selecting architectures and methods that leverage MPC’s guarantees with RL’s data-driven adaptability, and it underscores remaining challenges around differentiability, computation, and safety in real-time deployments.

Abstract

The fields of MPC and RL consider two successful control techniques for Markov decision processes. Both approaches are derived from similar fundamental principles, and both are widely used in practical applications, including robotics, process control, energy systems, and autonomous driving. Despite their similarities, MPC and RL follow distinct paradigms that emerged from diverse communities and different requirements. Various technical discrepancies, particularly the role of an environment model as part of the algorithm, lead to methodologies with nearly complementary advantages. Due to their orthogonal benefits, research interest in combination methods has recently increased significantly, leading to a large and growing set of complex ideas leveraging MPC and RL. This work illuminates the differences, similarities, and fundamentals that allow for different combination algorithms and categorizes existing work accordingly. Particularly, we focus on the versatile actor-critic RL approach as a basis for our categorization and examine how the online optimization approach of MPC can be used to improve the overall closed-loop performance of a policy.

Synthesis of Model Predictive Control and Reinforcement Learning: Survey and Classification

TL;DR

This survey addresses the long-standing divide between Model Predictive Control and Reinforcement Learning by articulating their fundamental similarities and orthogonal strengths. It introduces a taxonomy for combining MPC and RL through an actor-critic lens, covering expert-MPC, MPC-in-the-deployed-policy, and MPC-as-critic paradigms, along with architectures for parameterized MPC. The authors synthesize theoretical connections, notably MPC–MDP equivalence results, and survey a broad literature spanning aligned learning, closed-loop learning, and pre-/postprocessing, while highlighting practical software tools. The work aims to guide researchers and practitioners in selecting architectures and methods that leverage MPC’s guarantees with RL’s data-driven adaptability, and it underscores remaining challenges around differentiability, computation, and safety in real-time deployments.

Abstract

The fields of MPC and RL consider two successful control techniques for Markov decision processes. Both approaches are derived from similar fundamental principles, and both are widely used in practical applications, including robotics, process control, energy systems, and autonomous driving. Despite their similarities, MPC and RL follow distinct paradigms that emerged from diverse communities and different requirements. Various technical discrepancies, particularly the role of an environment model as part of the algorithm, lead to methodologies with nearly complementary advantages. Due to their orthogonal benefits, research interest in combination methods has recently increased significantly, leading to a large and growing set of complex ideas leveraging MPC and RL. This work illuminates the differences, similarities, and fundamentals that allow for different combination algorithms and categorizes existing work accordingly. Particularly, we focus on the versatile actor-critic RL approach as a basis for our categorization and examine how the online optimization approach of MPC can be used to improve the overall closed-loop performance of a policy.

Paper Structure

This paper contains 73 sections, 1 theorem, 40 equations, 8 figures, 9 tables.

Key Result

Theorem 10.1

Suppose that Assumption assume:S holds for $\bar{N} \geq N$. Then, there exists a terminal cost $\bar{V}_\theta^\textsc{mpc}$ and a stage cost $l_\theta^\textsc{mpc}$ such that the following identities hold, for all $\gamma$, $N\in\mathbb{N}$ and $s\in\mathcal{S}$:

Figures (8)

  • Figure 1: Paper structure. The main sections are highlighted in gray, subchapters in white, tables are green.
  • Figure 2: Modular view on combinations of MPC and RL. The combinations are aligned with the sections of this survey. The horizontal separation corresponds to using MPC as part of the expert actor, the deployed policy, and the RL critic. This overview highlights the possibility of using several instances of MPC with different roles in various parts of an RL algorithm and a deployed policy. MPC is used with fixed expert parameters within the expert actor, as a reference generator, as a postprocessing filter, or, possibly, in the critic. Learned MPC, i.e., an MPC structure involving learned parameters, can be used within the learned actor or possibly within the learned critic.
  • Figure 3: Parameterized MPC architectures: Proposed architectures of actors (or potentially critics) used in RL utilizing MPC and NN/simple parameters. In the integrated architecture, the NN is part of the optimization layer and depends on the decision variables. Suppose an NN can be evaluated separately from the optimization routine but provides the parameters to an MPC optimization layer. In that case, the architecture is referred to as hierarchical. If the MPC problem is solved in parallel to the NN, we refer to a parallel architecture. If the learned parameters do not depend on the current state $s$, the architecture is referred to as parameterized
  • Figure 4: Combinations: MPC as an expert actor. The plot is split into the learning and deployment phases. Blue boxes indicate Neural Networks (NNs), and green boxes are used for MPCs.
  • Figure 5: Combinations: MPC within the deployed policy. The plot is split into the learning and deployment phase. Blue boxes indicate Neural Networks (NNs), green boxes are used for MPCs, and blue/green boxes refer to parameterized MPCs that involve parameters/NNs that are learned during the learning phase, see Sect. \ref{['sec:architectures']}.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Remark 2.1
  • Remark 3.1
  • Theorem 10.1: kordabad_equivalence_2024
  • Proof 10.1