Table of Contents
Fetching ...

A modular framework for stabilizing deep reinforcement learning control

Nathan P. Lawrence, Philip D. Loewen, Shuyuan Wang, Michael G. Forbes, R. Bhushan Gopaluni

TL;DR

The paper addresses stability in RL-based control by embedding the Youla-Kučera parameterization to constrain the search to stable operators, while replacing explicit plant models with a data-driven internal model built from input-output data. It learns stable nonlinear operators via a Lyapunov-guided two-network design and realizes the Youla-Kučera framework in a model-free manner using Willems' fundamental lemma to relate data to the closed-loop behavior. The approach enables standard RL optimization over a stable $Q$-parameterization, with the objective $J(\pi) = \mathbb{E}_{h \sim p^{\pi}}[\sum_{t=0}^{\infty} \gamma^{t} r(s_t,a_t)]$, demonstrated on a simulated two-tank system where learning converges stably and achieves favorable performance. This work offers a practical path to stable, data-driven RL for process control, with clear avenues for extending to stochastic policies and unstable plants.

Abstract

We propose a framework for the design of feedback controllers that combines the optimization-driven and model-free advantages of deep reinforcement learning with the stability guarantees provided by using the Youla-Kucera parameterization to define the search domain. Recent advances in behavioral systems allow us to construct a data-driven internal model; this enables an alternative realization of the Youla-Kucera parameterization based entirely on input-output exploration data. Using a neural network to express a parameterized set of nonlinear stable operators enables seamless integration with standard deep learning libraries. We demonstrate the approach on a realistic simulation of a two-tank system.

A modular framework for stabilizing deep reinforcement learning control

TL;DR

The paper addresses stability in RL-based control by embedding the Youla-Kučera parameterization to constrain the search to stable operators, while replacing explicit plant models with a data-driven internal model built from input-output data. It learns stable nonlinear operators via a Lyapunov-guided two-network design and realizes the Youla-Kučera framework in a model-free manner using Willems' fundamental lemma to relate data to the closed-loop behavior. The approach enables standard RL optimization over a stable -parameterization, with the objective , demonstrated on a simulated two-tank system where learning converges stably and achieves favorable performance. This work offers a practical path to stable, data-driven RL for process control, with clear avenues for extending to stochastic policies and unstable plants.

Abstract

We propose a framework for the design of feedback controllers that combines the optimization-driven and model-free advantages of deep reinforcement learning with the stability guarantees provided by using the Youla-Kucera parameterization to define the search domain. Recent advances in behavioral systems allow us to construct a data-driven internal model; this enables an alternative realization of the Youla-Kucera parameterization based entirely on input-output exploration data. Using a neural network to express a parameterized set of nonlinear stable operators enables seamless integration with standard deep learning libraries. We demonstrate the approach on a realistic simulation of a two-tank system.
Paper Structure (9 sections, 3 theorems, 19 equations, 4 figures, 1 algorithm)

This paper contains 9 sections, 3 theorems, 19 equations, 4 figures, 1 algorithm.

Key Result

theorem 1

Let $\{ u_{t}, y_{t} \}_{t = 0}^{N-1}$ be a trajectory of an LTI system $(A, B, C, D)$ where $u$ is persistently exciting of order $L+n$. Then $\{ \bar{u}_{t}, \bar{y}_{t} \}_{t = 0}^{L-1}$ is a trajectory of $(A, B, C, D)$ if and only if there exists $\alpha \in \reals^{N-L+1}$ such that

Figures (4)

  • Figure 1: A stable nonlinear parameter $Q$ interacts with its environment; collected input-output trajectories are used to construct a Hankel matrix. These ingredients yield an equivalent realization of the Youla-Kučera parameterization.
  • Figure 2: Cumulative reward curve over $20$ training sessions. The solid line is the median and the shaded region shows the interquartile range. The dashed line and its shaded region are the final results of training without the stability constraint.
  • Figure 3: A global view of the training progress across all $20$ sessions. For each episode, a distribution of time spent at various output values is obtained. The heatmap shows the average amount of time spent at each episode--output coordinate.
  • Figure 4: A sample input-output rollout by the trained RL agent for one of the training sessions. Dashed lines are setpoints; solid lines are measured values.

Theorems & Definitions (5)

  • definition 1
  • definition 2
  • theorem 1: See vanwaarde2020WillemsFundamental
  • corollary 1
  • theorem 2