Sample-Efficient Reinforcement Learning of Koopman eNMPC
Daniel Mayfrank, Mehmet Velioglu, Alexander Mitsos, Manuel Dahmen
TL;DR
The paper tackles the challenge of sample-efficient RL-based training of data-driven (economic) nonlinear MPCs ((e)NMPCs) by integrating Model-Based Policy Optimization (MBPO) with differentiable Koopman (e)NMPCs. It introduces a physics-informed surrogate environment built from a PINN ensemble and jointly optimizes Koopman-model parameters and task-specific state-bound parameters via PPO, enabling end-to-end differentiability through the MPC layer. Applied to a CSTR-based demand-response case, the method outperforms data-driven SI-based eNMPCs and neural-network controllers trained with MBPO, achieving higher rewards and substantially improved sample efficiency, with physics-informed learning offering additional stability and efficiency gains. The approach addresses real-world constraints where environment interactions are costly, and it points to scalable extensions to larger, more complex systems and tight coupling with disturbance estimation or advanced MBPO variants. Overall, the work provides a practical pathway for deploying RL-enhanced predictive controllers in industrial settings where high-quality mechanistic models are unavailable or expensive to obtain.
Abstract
Reinforcement learning (RL) can be used to tune data-driven (economic) nonlinear model predictive controllers ((e)NMPCs) for optimal performance in a specific control task by optimizing the dynamic model or parameters in the policy's objective function or constraints, such as state bounds. However, the sample efficiency of RL is crucial, and to improve it, we combine a model-based RL algorithm with our published method that turns Koopman (e)NMPCs into automatically differentiable policies. We apply our approach to an eNMPC case study of a continuous stirred-tank reactor (CSTR) model from the literature. The approach outperforms benchmark methods, i.e., data-driven eNMPCs using models based on system identification without further RL tuning of the resulting policy, and neural network controllers trained with model-based RL, by achieving superior control performance and higher sample efficiency. Furthermore, utilizing partial prior knowledge about the system dynamics via physics-informed learning further increases sample efficiency.
