Table of Contents
Fetching ...

Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

Feiran Zhao, Florian Dörfler, Alessandro Chiuso, Keyou You

TL;DR

The paper tackles online adaptive learning of the LQR gain $K^*$ directly from closed-loop data without explicit system identification. It introduces a covariance-based policy parameterization that yields a direct data-driven LQR problem equivalent to the certainty-equivalence LQR, and develops the DeePO method to perform gradient updates from PE data with a recursive, one-step-per-sample update. A key contribution is the projected gradient dominance analysis, which guarantees global convergence of DeePO in offline data settings and provides non-asymptotic regret guarantees for online adaptive learning with bounded noise. Simulations demonstrate both computational and sample efficiency, showing sublinear regret and favorable comparisons against indirect adaptive control and zeroth-order PO. The framework lays groundwork for extensions to time-varying systems and other performance indices, highlighting practical impact for fast, data-driven adaptation in control systems.

Abstract

Direct data-driven design methods for the linear quadratic regulator (LQR) mainly use offline or episodic data batches, and their online adaptation has been acknowledged as an open problem. In this paper, we propose a direct adaptive method to learn the LQR from online closed-loop data. First, we propose a new policy parameterization based on the sample covariance to formulate a direct data-driven LQR problem, which is shown to be equivalent to the certainty-equivalence LQR with optimal non-asymptotic guarantees. Second, we design a novel data-enabled policy optimization (DeePO) method to directly update the policy, where the gradient is explicitly computed using only a batch of persistently exciting (PE) data. Third, we establish its global convergence via a projected gradient dominance property. Importantly, we efficiently use DeePO to adaptively learn the LQR by performing only one-step projected gradient descent per sample of the closed-loop system, which also leads to an explicit recursive update of the policy. Under PE inputs and for bounded noise, we show that the average regret of the LQR cost is upper-bounded by two terms signifying a sublinear decrease in time $\mathcal{O}(1/\sqrt{T})$ plus a bias scaling inversely with signal-to-noise ratio (SNR), which are independent of the noise statistics. Finally, we perform simulations to validate the theoretical results and demonstrate the computational and sample efficiency of our method.

Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

TL;DR

The paper tackles online adaptive learning of the LQR gain directly from closed-loop data without explicit system identification. It introduces a covariance-based policy parameterization that yields a direct data-driven LQR problem equivalent to the certainty-equivalence LQR, and develops the DeePO method to perform gradient updates from PE data with a recursive, one-step-per-sample update. A key contribution is the projected gradient dominance analysis, which guarantees global convergence of DeePO in offline data settings and provides non-asymptotic regret guarantees for online adaptive learning with bounded noise. Simulations demonstrate both computational and sample efficiency, showing sublinear regret and favorable comparisons against indirect adaptive control and zeroth-order PO. The framework lays groundwork for extensions to time-varying systems and other performance indices, highlighting practical impact for fast, data-driven adaptation in control systems.

Abstract

Direct data-driven design methods for the linear quadratic regulator (LQR) mainly use offline or episodic data batches, and their online adaptation has been acknowledged as an open problem. In this paper, we propose a direct adaptive method to learn the LQR from online closed-loop data. First, we propose a new policy parameterization based on the sample covariance to formulate a direct data-driven LQR problem, which is shown to be equivalent to the certainty-equivalence LQR with optimal non-asymptotic guarantees. Second, we design a novel data-enabled policy optimization (DeePO) method to directly update the policy, where the gradient is explicitly computed using only a batch of persistently exciting (PE) data. Third, we establish its global convergence via a projected gradient dominance property. Importantly, we efficiently use DeePO to adaptively learn the LQR by performing only one-step projected gradient descent per sample of the closed-loop system, which also leads to an explicit recursive update of the policy. Under PE inputs and for bounded noise, we show that the average regret of the LQR cost is upper-bounded by two terms signifying a sublinear decrease in time plus a bias scaling inversely with signal-to-noise ratio (SNR), which are independent of the noise statistics. Finally, we perform simulations to validate the theoretical results and demonstrate the computational and sample efficiency of our method.
Paper Structure (37 sections, 18 theorems, 97 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 18 theorems, 97 equations, 8 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

The feasible sets of prob:indirect and prob:equiV coincide under the change of variables $V = \Phi^{-1}[K^{\top},I_n]^{\top}$, and $J^* = C_{\text{CE}}^*$.

Figures (8)

  • Figure 1: An illustration of episodic approaches, where $h^i=(x_0,u_0,\dots,x_{T^i})$ denotes the $i$-th episode of data, and the episodes can be consecutive.
  • Figure 2: An illustration of indirect and direct adaptive approaches in a closed-loop system, where function $f_t$ has an explicit form.
  • Figure 3: Convergence of DeePO for the LQR with offline data, where $J^*$ is obtained by solving the convex program of (\ref{['prob:equiV']}).
  • Figure 4: Convergence of DeePO for adaptive learning of the LQR under different level of noise $\sigma$.
  • Figure 5: Comparison of DeePO and indirect methods for adaptive learning of the LQR. We also plot the optimality gap of the initial gain (dashed green line) computed by single batch methods from offline data de2021lowdorfler2021certaintydorfler22on.
  • ...and 3 more figures

Theorems & Definitions (24)

  • Remark 1
  • Remark 2
  • Lemma 1: Equivalence to the certainty-equivalence LQR
  • Remark 3: Implicit regularization
  • Remark 4
  • Lemma 2
  • Definition 1: Projected gradient dominance
  • Lemma 3: Projected gradient dominance of degree 1
  • Lemma 4: Local smoothness
  • Theorem 1: Global convergence
  • ...and 14 more