Table of Contents
Fetching ...

Computationally efficient Gauss-Newton reinforcement learning for model predictive control

Dean Brandner, Sebastien Gros, Sergio Lucia

Abstract

Model predictive control (MPC) is widely used in process control due to its interpretability and ability to handle constraints. As a parametric policy in reinforcement learning (RL), MPC offers strong initial performance and low data requirements compared to black-box policies like neural networks. However, most RL methods rely on first-order updates, which scale well to large parameter spaces but converge at most linearly, making them inefficient when each policy update requires solving an optimal control problem, as is the case with MPC. While MPC policies are typically low parameterized and thus amenable to second-order approaches, existing second-order methods demand second-order policy derivatives, which can be computationally intractable. This work introduces a Gauss-Newton approximation of the deterministic policy Hessian that eliminates the need for second-order policy derivatives, enabling superlinear convergence with minimal computational overhead. To further improve robustness, we propose a momentum-based Hessian averaging scheme for stable training under noisy estimates coupled with an adaptive trustregion. We demonstrate the effectiveness of the approach on a nonlinear continuously stirred tank reactor (CSTR), showing faster convergence and improved data efficiency over state-of-the-art firstorder methods and deep RL approaches.

Computationally efficient Gauss-Newton reinforcement learning for model predictive control

Abstract

Model predictive control (MPC) is widely used in process control due to its interpretability and ability to handle constraints. As a parametric policy in reinforcement learning (RL), MPC offers strong initial performance and low data requirements compared to black-box policies like neural networks. However, most RL methods rely on first-order updates, which scale well to large parameter spaces but converge at most linearly, making them inefficient when each policy update requires solving an optimal control problem, as is the case with MPC. While MPC policies are typically low parameterized and thus amenable to second-order approaches, existing second-order methods demand second-order policy derivatives, which can be computationally intractable. This work introduces a Gauss-Newton approximation of the deterministic policy Hessian that eliminates the need for second-order policy derivatives, enabling superlinear convergence with minimal computational overhead. To further improve robustness, we propose a momentum-based Hessian averaging scheme for stable training under noisy estimates coupled with an adaptive trustregion. We demonstrate the effectiveness of the approach on a nonlinear continuously stirred tank reactor (CSTR), showing faster convergence and improved data efficiency over state-of-the-art firstorder methods and deep RL approaches.

Paper Structure

This paper contains 20 sections, 6 theorems, 68 equations, 9 figures, 5 tables.

Key Result

Lemma 1

Let the action-value function of the optimal policy $Q^{\pi^\star}(\boldsymbol{s},\boldsymbol{a})$ be differentiable in $\boldsymbol{a}$ at $\boldsymbol{a} = \pi_{\boldsymbol{\theta}^\star}(\boldsymbol{s})$ for all $\boldsymbol{s}\in\mathcal{S}$, then the gradient evaluated at the optimal action is $\blacktriangleleft$$\blacktriangleleft$

Figures (9)

  • Figure 1: Comparison between deep RL (left) and MPC-based RL (right).
  • Figure 2: Hessian and its approximations over $\theta$ for $\gamma = 0.9$ and $\sigma_w^2 = \sigma_0^2 = 0.1$. The Hessian at $\theta^\star$ is highlighted as a star.
  • Figure 3: Error between the current iterate and the optimal parameter $| \theta_k -\theta^\star|$ for the second-order approaches and first-order gradient ascent starting at the initial guess $\theta_0 = 0.6$.
  • Figure 4: CSTR with entering and leaving streams.
  • Figure 5: Expected cumulative reward for increasingly more parameterized MPC schemes. The number in parentheses denotes the total number of trainable parameters.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Lemma 1
  • Corollary 1
  • Theorem 1
  • Corollary 2
  • Theorem 2
  • Corollary 3