Task-optimal data-driven surrogate models for eNMPC via differentiable simulation and optimization

Daniel Mayfrank; Na Young Ahn; Alexander Mitsos; Manuel Dahmen

Task-optimal data-driven surrogate models for eNMPC via differentiable simulation and optimization

Daniel Mayfrank, Na Young Ahn, Alexander Mitsos, Manuel Dahmen

TL;DR

This work addresses the real-time implementation bottleneck of economic NMPC for nonlinear processes by learning task-optimized Koopman surrogates through differentiable simulation. It combines end-to-end Koopman surrogate learning, differentiable eNMPC policy construction, and SHAC-based policy optimization to exploit analytic gradients. Applied to a CSTR eNMPC benchmark with a horizon of $9$ hours, the approach matches benchmark economics while strictly avoiding constraint violations. The results highlight the stability and constraint-satisfaction benefits of SHAC over PPO-based or non-end-to-end training, suggesting differentiable-environment gradients as a practical tool for large-scale, nonlinear predictive control.

Abstract

Mechanistic dynamic process models may be too computationally expensive to be usable as part of a real-time capable predictive controller. We present a method for end-to-end learning of Koopman surrogate models for optimal performance in a specific control task. In contrast to previous contributions that employ standard reinforcement learning (RL) algorithms, we use a training algorithm that exploits the differentiability of environments based on mechanistic simulation models to aid the policy optimization. We evaluate the performance of our method by comparing it to that of other training algorithms on an existing economic nonlinear model predictive control (eNMPC) case study of a continuous stirred-tank reactor (CSTR) model. Compared to the benchmark methods, our method produces similar economic performance while eliminating constraint violations. Thus, for this case study, our method outperforms the others and offers a promising path toward more performant controllers that employ dynamic surrogate models.

Task-optimal data-driven surrogate models for eNMPC via differentiable simulation and optimization

TL;DR

hours, the approach matches benchmark economics while strictly avoiding constraint violations. The results highlight the stability and constraint-satisfaction benefits of SHAC over PPO-based or non-end-to-end training, suggesting differentiable-environment gradients as a practical tool for large-scale, nonlinear predictive control.

Abstract

Paper Structure (8 sections, 4 equations, 4 figures, 1 table)

This paper contains 8 sections, 4 equations, 4 figures, 1 table.

Introduction
Method
Numerical experiments
Case study description
Training setup
Results
Policy gradient analysis
Conclusion

Figures (4)

Figure 1: (a) Comparison of two paradigms for the training of data-driven dynamic surrogate models for use in eNMPC. (b) The differentiable eNMPC policy takes as input the current state $\bm{x}_t$ and computes the optimal control action $\bm{u}^{*}_{t}$ based on a cost function $f$, inequality constraints $\bm{g}$, and the learnable discrete-time dynamic surrogate model $\bm{h_\theta}$ (highlighted in blue font), which is parameterized by $\bm{\theta}$.
Figure 2: Workflow from mechanistic model to task-optimal dynamic Koopman surrogate model. Adapted from mayfrank2024end.
Figure 3: Using SHAC to train a task-optimal Koopman surrogate model for the transition function $\bm{\mathcal{F}}$. This figure can be interpreted as a SHAC-specific unrolled version of the typical RL loop shown in the third step in Fig. \ref{['fig:workflow']}. The policy is optimized by adjusting the parameters $\bm{\theta}$ of the dynamic Koopman surrogate model. $\Phi$ is a convex function for the stage cost of the objective function. To ensure the feasibility of the resulting optimal control problems, we add slack variables $\bm{s}_t$ to the state bounds (mayfrank2024end). Their use is penalized quadratically using a penalty factor $M$. Due to the use of PyTorch and cvxpylayers (Agrawal2019differentiable), the output $\bm{u}_t$ of the policy is differentiable with respect to $\bm{x}_t$ and $\bm{\theta}$. The critic is a feedforward neural network with trainable parameters $\bm{\phi}$. To increase the clarity of the figure, we omit the direct dependence of $r_{t+1}$ with respect to $\bm{u}_t$.
Figure 4: Learning progress in the Koopman-SHAC training runs. The dark orange line indicates the running average reward over the previous 1024 steps in the environment, averaged over all ten training runs. The light orange region indicates one standard deviation of the performance variance between the training runs.

Task-optimal data-driven surrogate models for eNMPC via differentiable simulation and optimization

TL;DR

Abstract

Task-optimal data-driven surrogate models for eNMPC via differentiable simulation and optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (4)