Stability-Certified On-Policy Data-Driven LQR via Recursive Learning and Policy Gradient

Lorenzo Sforni; Guido Carnevale; Ivano Notarnicola; Giuseppe Notarstefano

Stability-Certified On-Policy Data-Driven LQR via Recursive Learning and Policy Gradient

Lorenzo Sforni, Guido Carnevale, Ivano Notarnicola, Giuseppe Notarstefano

TL;DR

This work addresses infinite-horizon LQR for unknown dynamics by integrating online system identification with policy-gradient optimization in an on-policy setting. The Relearn LQR algorithm simultaneously updates an estimate of the unknown pair $(A_ullet,B_ullet)$ via a Recursive Least Squares–like mechanism and refines the feedback gain $K$ through a gradient flow, while injecting a dithering signal to guarantee persistent excitation. A Lyapunov-based, averaging-theory analysis for two-time-scale systems yields stability guarantees for the entire closed-loop learning and control loop, establishing convergence to the optimal gain $K^ullet$ and the true system matrices under small step sizes. Numerical experiments on an aircraft model with static and drifting parameters validate both the convergence and robustness properties, highlighting practical viability for data-driven control with stability certificates.

Abstract

In this paper, we investigate a data-driven framework to solve Linear Quadratic Regulator (LQR) problems when the dynamics is unknown, with the additional challenge of providing stability certificates for the overall learning and control scheme. Specifically, in the proposed on-policy learning framework, the control input is applied to the actual (unknown) linear system while iteratively optimized. We propose a learning and control procedure, termed Relearn LQR, that combines a recursive least squares method with a direct policy search based on the gradient method. The resulting scheme is analyzed by modeling it as a feedback-interconnected nonlinear dynamical system. A Lyapunov-based approach, exploiting averaging and timescale separation theories for nonlinear systems, allows us to provide formal stability guarantees for the whole interconnected scheme. The effectiveness of the proposed strategy is corroborated by numerical simulations, where Relearn LQR is deployed on an aircraft control problem, with both static and drifting parameters.

Stability-Certified On-Policy Data-Driven LQR via Recursive Learning and Policy Gradient

TL;DR

via a Recursive Least Squares–like mechanism and refines the feedback gain

through a gradient flow, while injecting a dithering signal to guarantee persistent excitation. A Lyapunov-based, averaging-theory analysis for two-time-scale systems yields stability guarantees for the entire closed-loop learning and control loop, establishing convergence to the optimal gain

and the true system matrices under small step sizes. Numerical experiments on an aircraft model with static and drifting parameters validate both the convergence and robustness properties, highlighting practical viability for data-driven control with stability certificates.

Abstract

Paper Structure (19 sections, 6 theorems, 86 equations, 8 figures, 1 algorithm)

This paper contains 19 sections, 6 theorems, 86 equations, 8 figures, 1 algorithm.

Introduction
Preliminaries and Problem Setup
Preliminaries on averaging theory for two-time-scale systems
On-Policy Data-Driven LQR: Problem Setup
Preliminaries: Model-based Gradient Method for LQR
Model-based reduced problem formulation
Model-based gradient method for problem \ref{['eq:LQR']}
On-policy LQR for Unknown Systems: Concurrent Learning and Optimization
Stability Analysis
Closed-Loop Dynamics in Error Coordinates
Averaged System Analysis
Proof of Theorem \ref{['th:convergence']}
Numerical Simulations
Aircraft Control
Aircraft Control with Drifting Parameters
...and 4 more sections

Key Result

Theorem 2.6

bai1988averaging Consider system eq:mixed_scale_system and let Assumptions ass:lipschitz, ass:equilibrium, ass:schur, ass:limit and ass:nu hold. If there exists $\epsilon_0 > 0$ such that, for all $\epsilon \in (0,\epsilon_0)$, the origin is exponentially stable for system eq:average_app, then there

Figures (8)

Figure 1: Schematic representation of the stability-certified on-policy LQR setup.
Figure 2: Representation of the concurrent learning and optimization scheme implemented by Relearn LQR.
Figure 3: Block diagram describing system \ref{['eq:mixed_scale_system_our']}.
Figure 4: Block diagram of \ref{['eq:algo_closed_loop_average']} with $\tilde{\mathrm{z}}^{\textsc{av}}_t = \mathop{\mathrm{col}}\nolimits(\tilde{\theta}^{\textsc{av}}_{t},\tilde{K}^{\textsc{av}}_{t})$.
Figure 5: (left) Evolution of the normalized cost error $|J(K_{t},\theta^\star_t) - J^\star|/J^\star$. (right) Evolution of the normalized estimation error about $\left \|\theta_{t} - \theta^\star \right \|/\left \|\theta^\star \right \|$ (left).
...and 3 more figures

Theorems & Definitions (10)

Theorem 2.6
Remark 2.8
Remark 3.2
Theorem 3.3
Remark 3.4
Lemma 4.1
Lemma 4.2
Proposition 4.3
Lemma B.1
Proof B.2

Stability-Certified On-Policy Data-Driven LQR via Recursive Learning and Policy Gradient

TL;DR

Abstract

Stability-Certified On-Policy Data-Driven LQR via Recursive Learning and Policy Gradient

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (10)