Table of Contents
Fetching ...

Sample-Efficient Reinforcement Learning of Koopman eNMPC

Daniel Mayfrank, Mehmet Velioglu, Alexander Mitsos, Manuel Dahmen

TL;DR

The paper tackles the challenge of sample-efficient RL-based training of data-driven (economic) nonlinear MPCs ((e)NMPCs) by integrating Model-Based Policy Optimization (MBPO) with differentiable Koopman (e)NMPCs. It introduces a physics-informed surrogate environment built from a PINN ensemble and jointly optimizes Koopman-model parameters and task-specific state-bound parameters via PPO, enabling end-to-end differentiability through the MPC layer. Applied to a CSTR-based demand-response case, the method outperforms data-driven SI-based eNMPCs and neural-network controllers trained with MBPO, achieving higher rewards and substantially improved sample efficiency, with physics-informed learning offering additional stability and efficiency gains. The approach addresses real-world constraints where environment interactions are costly, and it points to scalable extensions to larger, more complex systems and tight coupling with disturbance estimation or advanced MBPO variants. Overall, the work provides a practical pathway for deploying RL-enhanced predictive controllers in industrial settings where high-quality mechanistic models are unavailable or expensive to obtain.

Abstract

Reinforcement learning (RL) can be used to tune data-driven (economic) nonlinear model predictive controllers ((e)NMPCs) for optimal performance in a specific control task by optimizing the dynamic model or parameters in the policy's objective function or constraints, such as state bounds. However, the sample efficiency of RL is crucial, and to improve it, we combine a model-based RL algorithm with our published method that turns Koopman (e)NMPCs into automatically differentiable policies. We apply our approach to an eNMPC case study of a continuous stirred-tank reactor (CSTR) model from the literature. The approach outperforms benchmark methods, i.e., data-driven eNMPCs using models based on system identification without further RL tuning of the resulting policy, and neural network controllers trained with model-based RL, by achieving superior control performance and higher sample efficiency. Furthermore, utilizing partial prior knowledge about the system dynamics via physics-informed learning further increases sample efficiency.

Sample-Efficient Reinforcement Learning of Koopman eNMPC

TL;DR

The paper tackles the challenge of sample-efficient RL-based training of data-driven (economic) nonlinear MPCs ((e)NMPCs) by integrating Model-Based Policy Optimization (MBPO) with differentiable Koopman (e)NMPCs. It introduces a physics-informed surrogate environment built from a PINN ensemble and jointly optimizes Koopman-model parameters and task-specific state-bound parameters via PPO, enabling end-to-end differentiability through the MPC layer. Applied to a CSTR-based demand-response case, the method outperforms data-driven SI-based eNMPCs and neural-network controllers trained with MBPO, achieving higher rewards and substantially improved sample efficiency, with physics-informed learning offering additional stability and efficiency gains. The approach addresses real-world constraints where environment interactions are costly, and it points to scalable extensions to larger, more complex systems and tight coupling with disturbance estimation or advanced MBPO variants. Overall, the work provides a practical pathway for deploying RL-enhanced predictive controllers in industrial settings where high-quality mechanistic models are unavailable or expensive to obtain.

Abstract

Reinforcement learning (RL) can be used to tune data-driven (economic) nonlinear model predictive controllers ((e)NMPCs) for optimal performance in a specific control task by optimizing the dynamic model or parameters in the policy's objective function or constraints, such as state bounds. However, the sample efficiency of RL is crucial, and to improve it, we combine a model-based RL algorithm with our published method that turns Koopman (e)NMPCs into automatically differentiable policies. We apply our approach to an eNMPC case study of a continuous stirred-tank reactor (CSTR) model from the literature. The approach outperforms benchmark methods, i.e., data-driven eNMPCs using models based on system identification without further RL tuning of the resulting policy, and neural network controllers trained with model-based RL, by achieving superior control performance and higher sample efficiency. Furthermore, utilizing partial prior knowledge about the system dynamics via physics-informed learning further increases sample efficiency.

Paper Structure

This paper contains 16 sections, 11 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: (a) Procedure for RL-based training of an (eN)MPC. (b) A differentiable eNMPC policy parameterized by the parameters $\bm{\theta}$ of the dynamic model; takes as input the current state $\bm{x}_t$ and computes the optimal control action $\bm{u}^{*}_{t}$ based on the minimization of a cost function $f$, subject to inequality constraints $\bm{g}$, and the learnable dynamic model $\bm{h_\theta}$. (c) Differentiable eNMPC policy with parameterized inequality constraints $\bm{g}$, e.g., state bounds. Training this policy leaves the underlying dynamic model $\bm{h}$ unchanged but adapts the inequality constraints to counteract model-plant mismatch.
  • Figure 2: Dyna-style (sutton1991dyna) model-based RL framework. The three steps are repeated for a predefined number of steps, or until satisfactory control performance is reached.
  • Figure 3: Using MBPO to train a task-optimal Koopman (e)NMPC controller. (a) The training algorithm. The following three steps are executed in a loop until a stopping criterion is reached: First, the Koopman (e)NMPC interacts with the environment to gather data about the dynamics. Second, all data collected up to the current step is used to fit the Koopman model (parameters $\bm{\theta}_{\text{K}}$) and the PINN ensemble (parameters $\bm{\omega}_i \forall i \in \{1,2,\dots,n\}$). Third, a surrogate RL environment is constructed using the NN ensemble and the Koopman (e)NMPC is optimized by tuning the parameters $\bm{\theta}_{\text{B}}$, i.e., the state bounds. (b) The automatically differentiable Koopman (e)NMPC whose behavior is defined by the parameters of the Koopman model ($\bm{\theta}_{\text{K}}$) and the parameters modifying the state bounds ($\bm{\theta}_{\text{B}}$). The parameters are color-coded to match the colors of the corresponding optimization steps in Fig. \ref{['fig:method_mbpo']}.
  • Figure 4: Critic architecture. As for the policy, the system states are scaled so that the feasible range is in [-1,1], whereas the product storage and the electricity prices are left unscaled.
  • Figure 5: General schematic of the PINN models used in the CSTR case study.
  • ...and 4 more figures