Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

Feiran Zhao; Florian Dörfler; Alessandro Chiuso; Keyou You

Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

Feiran Zhao, Florian Dörfler, Alessandro Chiuso, Keyou You

TL;DR

The paper tackles online adaptive learning of the LQR gain $K^*$ directly from closed-loop data without explicit system identification. It introduces a covariance-based policy parameterization that yields a direct data-driven LQR problem equivalent to the certainty-equivalence LQR, and develops the DeePO method to perform gradient updates from PE data with a recursive, one-step-per-sample update. A key contribution is the projected gradient dominance analysis, which guarantees global convergence of DeePO in offline data settings and provides non-asymptotic regret guarantees for online adaptive learning with bounded noise. Simulations demonstrate both computational and sample efficiency, showing sublinear regret and favorable comparisons against indirect adaptive control and zeroth-order PO. The framework lays groundwork for extensions to time-varying systems and other performance indices, highlighting practical impact for fast, data-driven adaptation in control systems.

Abstract

Direct data-driven design methods for the linear quadratic regulator (LQR) mainly use offline or episodic data batches, and their online adaptation has been acknowledged as an open problem. In this paper, we propose a direct adaptive method to learn the LQR from online closed-loop data. First, we propose a new policy parameterization based on the sample covariance to formulate a direct data-driven LQR problem, which is shown to be equivalent to the certainty-equivalence LQR with optimal non-asymptotic guarantees. Second, we design a novel data-enabled policy optimization (DeePO) method to directly update the policy, where the gradient is explicitly computed using only a batch of persistently exciting (PE) data. Third, we establish its global convergence via a projected gradient dominance property. Importantly, we efficiently use DeePO to adaptively learn the LQR by performing only one-step projected gradient descent per sample of the closed-loop system, which also leads to an explicit recursive update of the policy. Under PE inputs and for bounded noise, we show that the average regret of the LQR cost is upper-bounded by two terms signifying a sublinear decrease in time $\mathcal{O}(1/\sqrt{T})$ plus a bias scaling inversely with signal-to-noise ratio (SNR), which are independent of the noise statistics. Finally, we perform simulations to validate the theoretical results and demonstrate the computational and sample efficiency of our method.

Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

TL;DR

The paper tackles online adaptive learning of the LQR gain

directly from closed-loop data without explicit system identification. It introduces a covariance-based policy parameterization that yields a direct data-driven LQR problem equivalent to the certainty-equivalence LQR, and develops the DeePO method to perform gradient updates from PE data with a recursive, one-step-per-sample update. A key contribution is the projected gradient dominance analysis, which guarantees global convergence of DeePO in offline data settings and provides non-asymptotic regret guarantees for online adaptive learning with bounded noise. Simulations demonstrate both computational and sample efficiency, showing sublinear regret and favorable comparisons against indirect adaptive control and zeroth-order PO. The framework lays groundwork for extensions to time-varying systems and other performance indices, highlighting practical impact for fast, data-driven adaptation in control systems.

Abstract

plus a bias scaling inversely with signal-to-noise ratio (SNR), which are independent of the noise statistics. Finally, we perform simulations to validate the theoretical results and demonstrate the computational and sample efficiency of our method.

Paper Structure (37 sections, 18 theorems, 97 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 18 theorems, 97 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Data-driven formulations and adaptive learning of the LQR
Model-based LQR
Indirect certainty-equivalence LQR with ordinary least-square identification
Direct LQR with data-based policy parameterization
PO of the LQR using zeroth-order gradient estimate
Direct adaptive learning for the LQR with online closed-loop data
Direct data-driven LQR with a new policy parameterization
A new policy parameterization using sample covariance
The equivalence between the covariance parameterization of the LQR and the indirect certainty-equivalence LQR
DeePO for the LQR with covariance parameterization using offline data
Data-enabled policy optimization to solve the LQR problem with covariance parameterization (\ref{['prob:equiV']})
Projected gradient dominance of the LQR cost
Global convergence of DeePO
DeePO for direct, adaptive, and recursive learning of the LQR with online closed-loop data
...and 22 more sections

Key Result

Lemma 1

The feasible sets of prob:indirect and prob:equiV coincide under the change of variables $V = \Phi^{-1}[K^{\top},I_n]^{\top}$, and $J^* = C_{\text{CE}}^*$.

Figures (8)

Figure 1: An illustration of episodic approaches, where $h^i=(x_0,u_0,\dots,x_{T^i})$ denotes the $i$-th episode of data, and the episodes can be consecutive.
Figure 2: An illustration of indirect and direct adaptive approaches in a closed-loop system, where function $f_t$ has an explicit form.
Figure 3: Convergence of DeePO for the LQR with offline data, where $J^*$ is obtained by solving the convex program of (\ref{['prob:equiV']}).
Figure 4: Convergence of DeePO for adaptive learning of the LQR under different level of noise $\sigma$.
Figure 5: Comparison of DeePO and indirect methods for adaptive learning of the LQR. We also plot the optimality gap of the initial gain (dashed green line) computed by single batch methods from offline data de2021lowdorfler2021certaintydorfler22on.
...and 3 more figures

Theorems & Definitions (24)

Remark 1
Remark 2
Lemma 1: Equivalence to the certainty-equivalence LQR
Remark 3: Implicit regularization
Remark 4
Lemma 2
Definition 1: Projected gradient dominance
Lemma 3: Projected gradient dominance of degree 1
Lemma 4: Local smoothness
Theorem 1: Global convergence
...and 14 more

Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

TL;DR

Abstract

Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (24)