End-to-End Reinforcement Learning of Koopman Models for Economic Nonlinear Model Predictive Control

Daniel Mayfrank; Alexander Mitsos; Manuel Dahmen

End-to-End Reinforcement Learning of Koopman Models for Economic Nonlinear Model Predictive Control

Daniel Mayfrank, Alexander Mitsos, Manuel Dahmen

TL;DR

The paper tackles the challenge of achieving accurate yet fast control for economic nonlinear model predictive control (eNMPC) by learning task-optimized Koopman surrogate models in an end-to-end reinforcement learning (RL) framework. It represents nonlinear dynamics in a lifted linear space with $z_0= \psi_{\theta}(x_0)$, $z_{t+1}= A_{\theta} z_t + B_{\theta} u_t$, and $\hat{x}_t= C_{\theta} z_t$, which yields convex OCPs solvable in real time and differentiable for RL updates via cvxpylayers. Through two CSTR-based case studies, the authors show that end-to-end learned Koopman models outperform system-identification-trained models and that the resulting eNMPC controllers can adapt to control-setting changes without retraining, unlike model-free RL policies. The work highlights the practical potential of combining Koopman embeddings, differentiable MPC, and policy-optimization techniques to produce robust, computation-efficient economic controllers for nonlinear processes, and points to future work on scaling to larger systems and integrating with model-based RL components.

Abstract

(Economic) nonlinear model predictive control ((e)NMPC) requires dynamic models that are sufficiently accurate and computationally tractable. Data-driven surrogate models for mechanistic models can reduce the computational burden of (e)NMPC; however, such models are typically trained by system identification for maximum prediction accuracy on simulation samples and perform suboptimally in (e)NMPC. We present a method for end-to-end reinforcement learning of Koopman surrogate models for optimal performance as part of (e)NMPC. We apply our method to two applications derived from an established nonlinear continuous stirred-tank reactor model. The controller performance is compared to that of (e)NMPCs utilizing models trained using system identification, and model-free neural network controllers trained using reinforcement learning. We show that the end-to-end trained models outperform those trained using system identification in (e)NMPC, and that, in contrast to the neural network controllers, the (e)NMPC controllers can react to changes in the control setting without retraining.

End-to-End Reinforcement Learning of Koopman Models for Economic Nonlinear Model Predictive Control

TL;DR

, and

, which yields convex OCPs solvable in real time and differentiable for RL updates via cvxpylayers. Through two CSTR-based case studies, the authors show that end-to-end learned Koopman models outperform system-identification-trained models and that the resulting eNMPC controllers can adapt to control-setting changes without retraining, unlike model-free RL policies. The work highlights the practical potential of combining Koopman embeddings, differentiable MPC, and policy-optimization techniques to produce robust, computation-efficient economic controllers for nonlinear processes, and points to future work on scaling to larger systems and integrating with model-based RL components.

Abstract

Paper Structure (14 sections, 12 equations, 7 figures, 6 tables)

This paper contains 14 sections, 12 equations, 7 figures, 6 tables.

Introduction
Method
Koopman theory for control
Deep reinforcement learning with continuous action spaces
Post-optimal sensitivity analysis of convex problems
End-to-end learning of Koopman models for MPC
Numerical experiments
Case study description
Data sampling and system identification
NMPC
eNMPC
Analysis of Koopman embedding before and after end-to-end training
eNMPC with adapted bounds
Conclusion

Figures (7)

Figure 1: Workflow from mechanistic model to task-optimal dynamic Koopman surrogate model.
Figure 2: Method for end-to-end refinement of dynamic Koopman surrogate model. The RL agent consists of a stochastic actor and a critic. The actor is an MPC policy utilizing a dynamic Koopman surrogate model. The critic is a feedforward neural network. The environment consists of the mechanistic model of the system that is to be controlled, and a reward function that depends upon the controllers task.
Figure 3: Summary of training runs in NMPC. Each training configuration is run ten times with different random seeds. We average the running score over the last 30 episodes. The lines represent individual training runs. Additionally, we highlight the highest average running score achieved by a policy type (star for Koopman-RL, dot for MLP).
Figure 4: NMPC: comparison of the controller behavior, given the same randomly generated production rate trajectory. Best viewed in color.
Figure 5: Summary of RL training runs in eNMPC. Each training configuration is run ten times with different random seeds. We average the running score over the previous 30 episodes. The lines represent individual training runs. The constant dashed green line represents the median maximum performance when training the MLP policies for 20,000 episodes. Additionally, we highlight the highest average running score achieved by a policy type (star for Koopman-RL, dot for MLP). Best viewed in color.
...and 2 more figures

End-to-End Reinforcement Learning of Koopman Models for Economic Nonlinear Model Predictive Control

TL;DR

Abstract

End-to-End Reinforcement Learning of Koopman Models for Economic Nonlinear Model Predictive Control

Authors

TL;DR

Abstract

Table of Contents

Figures (7)