Stabilizing reinforcement learning control: A modular framework for optimizing over all stable behavior

Nathan P. Lawrence; Philip D. Loewen; Shuyuan Wang; Michael G. Forbes; R. Bhushan Gopaluni

Stabilizing reinforcement learning control: A modular framework for optimizing over all stable behavior

Nathan P. Lawrence, Philip D. Loewen, Shuyuan Wang, Michael G. Forbes, R. Bhushan Gopaluni

TL;DR

This work addresses the challenge of ensuring stability in reinforcement learning-based control by embedding the Youla–Kučera parameterization into a data-driven framework. By constructing a data-driven internal model via Willems' dynamic lemma and Hankel data, the authors delineate a stable operator $Q$ that governs the closed-loop behavior through $K(z)=Q(z)/(1-P(z)Q(z))$, enabling both linear and nonlinear realizations and fixed-structure tuning. They establish stability criteria for Hankel models under noise, provide probabilistic bounds for random Hankel matrices, and develop Lyapunov-based methods to train stable $Q$ with modular RL integration. Simulation studies on an industrial tank and fixed-structure controller tuning demonstrate improved stability and performance, illustrating the framework's practical potential for safe, data-driven control in process systems. Overall, the modular approach decouples algorithms, function approximators, and dynamic models, offering a scalable path to stable, data-driven RL across linear, nonlinear, and MIMO settings.

Abstract

We propose a framework for the design of feedback controllers that combines the optimization-driven and model-free advantages of deep reinforcement learning with the stability guarantees provided by using the Youla-Kucera parameterization to define the search domain. Recent advances in behavioral systems allow us to construct a data-driven internal model; this enables an alternative realization of the Youla-Kucera parameterization based entirely on input-output exploration data. Perhaps of independent interest, we formulate and analyze the stability of such data-driven models in the presence of noise. The Youla-Kucera approach requires a stable "parameter" for controller design. For the training of reinforcement learning agents, the set of all stable linear operators is given explicitly through a matrix factorization approach. Moreover, a nonlinear extension is given using a neural network to express a parameterized set of stable operators, which enables seamless integration with standard deep learning libraries. Finally, we show how these ideas can also be applied to tune fixed-structure controllers.

Stabilizing reinforcement learning control: A modular framework for optimizing over all stable behavior

TL;DR

that governs the closed-loop behavior through

, enabling both linear and nonlinear realizations and fixed-structure tuning. They establish stability criteria for Hankel models under noise, provide probabilistic bounds for random Hankel matrices, and develop Lyapunov-based methods to train stable

with modular RL integration. Simulation studies on an industrial tank and fixed-structure controller tuning demonstrate improved stability and performance, illustrating the framework's practical potential for safe, data-driven control in process systems. Overall, the modular approach decouples algorithms, function approximators, and dynamic models, offering a scalable path to stable, data-driven RL across linear, nonlinear, and MIMO settings.

Abstract

Paper Structure (22 sections, 11 theorems, 54 equations, 5 figures, 2 algorithms)

This paper contains 22 sections, 11 theorems, 54 equations, 5 figures, 2 algorithms.

Introduction
Contributions
Related work
Notation
Background
A dynamic Willems' lemma as an internal model
Data-driven realization of the Youla-Kučera parameterization
On the stability of noisy Hankel matrices
Data-driven stability test
Random Hankel matrices
Hankel models with additive noise
Stabilizing reinforcement learning control
Learning stable operators
Unconstrained reinforcement learning over stable operators
Simulation studies
...and 7 more sections

Key Result

Theorem 2.6

Let $\{ u_{t}, y_{t} \}_{t = 0}^{N-1}$ be a trajectory of LTI system $(A, B, C)$ where $u$ is persistently exciting of order $L+n$. Then $\{ \overline{u}_{t}, \overline{y}_{t} \}_{t = 0}^{L-1}$ is a trajectory of $(A, B, C)$ if and only if there exists $\alpha \in \mathbb{R}^{N-L+1}$ such that Here the right-hand side is the block-structured column vector formed from $\overline{u} = \left[u_0 \ld

Figures (5)

Figure 1: Cumulative reward over two PI-tuning experiments: one using the proposed stabilizing framework and the other using standard RL. The stability-agnostic agent often destabilizes the system and struggles to recover.
Figure 2: $100$ time steps of input-output data are collected using a standard normal probing signal. The recursion in \ref{['eq:alpha_dynamics']} is used to continue the rollout. This is done several times for different samples of output noise. The bottom figure is the evolution of the spectral radii for the noisy and noise-free matrices $H^{+} H'$.
Figure 3: Cumulative reward curve over $20$ training sessions. The solid line is the median and the shaded region shows the interquartile range. The dashed line and its shaded region are the final results of training without the stability constraint.
Figure 4: A sample input-output rollout by the trained RL agent for one of the training sessions. Dashed lines are setpoints; solid lines are measured values.
Figure 5: Heatmap of projected PI parameters strictly inside the stability boundary.

Theorems & Definitions (24)

Definition 2.4
Definition 2.5
Theorem 2.6: See vanwaarde2020WillemsFundamental
Corollary 2.7
Remark 2.8
proof
Remark 2.9
Theorem 2.10
proof
Lemma 3.1: Hanson--Wright inequality, adapted from rudelson2013HansonWrightInequality
...and 14 more

Stabilizing reinforcement learning control: A modular framework for optimizing over all stable behavior

TL;DR

Abstract

Stabilizing reinforcement learning control: A modular framework for optimizing over all stable behavior

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (24)