Table of Contents
Fetching ...

A model-free first-order method for linear quadratic regulator with $\tilde{O}(1/\varepsilon)$ sampling complexity

Caleb Ju, Georgios Kotsalis, Guanghui Lan

TL;DR

This work tackles online, model-free linear quadratic regulation with stochastic disturbances by developing an actor–critic framework that combines a natural policy gradient (actor) with a shrinking multi-epoch conditional stochastic primal-dual (critic) method. The key innovation is letting the gradient estimation error scale with the variance rather than the standard deviation, which yields an overall sampling complexity of $\tilde{O}(1/\varepsilon)$ to achieve an $\varepsilon$-suboptimal policy under Markovian noise. The critic solves a min–max Bellman residual via CSPD, handling online data from a single ergodic trajectory, while the actor maintains stability and achieves linear convergence under a gradient-domination (PŁ) condition. Collectively, the algorithm matches the best known model-based rates up to logarithmic factors and improves over prior model-free results by relaxing almost-sure stability assumptions and enabling online operation without resets, with demonstrated empirical performance on synthetic and aerospace control tasks.

Abstract

We consider the classic stochastic linear quadratic regulator (LQR) problem under an infinite horizon average stage cost. By leveraging recent policy gradient methods from reinforcement learning, we obtain a first-order method that finds a stable feedback law whose objective function gap to the optima is at most $\varepsilon$ with high probability using $\tilde{O}(1/\varepsilon)$ samples, where $\tilde{O}$ hides polylogarithmic dependence on $\varepsilon$. Our proposed method seems to have the best dependence on $\varepsilon$ within the model-free literature without the assumption that all policies generated by the algorithm are stable almost surely, and it matches the best-known rate from the model-based literature, up to logarithmic factors. The improved dependence on $\varepsilon$ is achieved by showing the accuracy scales with the variance rather than the standard deviation of the gradient estimation error. Our developments that result in this improved sampling complexity fall in the category of actor-critic algorithms. The actor part involves a variational inequality formulation of the stochastic LQR problem, while in the critic part, we utilize a conditional stochastic primal-dual method and show that the algorithm has the optimal rate of convergence when paired with a shrinking multi-epoch scheme.

A model-free first-order method for linear quadratic regulator with $\tilde{O}(1/\varepsilon)$ sampling complexity

TL;DR

This work tackles online, model-free linear quadratic regulation with stochastic disturbances by developing an actor–critic framework that combines a natural policy gradient (actor) with a shrinking multi-epoch conditional stochastic primal-dual (critic) method. The key innovation is letting the gradient estimation error scale with the variance rather than the standard deviation, which yields an overall sampling complexity of to achieve an -suboptimal policy under Markovian noise. The critic solves a min–max Bellman residual via CSPD, handling online data from a single ergodic trajectory, while the actor maintains stability and achieves linear convergence under a gradient-domination (PŁ) condition. Collectively, the algorithm matches the best known model-based rates up to logarithmic factors and improves over prior model-free results by relaxing almost-sure stability assumptions and enabling online operation without resets, with demonstrated empirical performance on synthetic and aerospace control tasks.

Abstract

We consider the classic stochastic linear quadratic regulator (LQR) problem under an infinite horizon average stage cost. By leveraging recent policy gradient methods from reinforcement learning, we obtain a first-order method that finds a stable feedback law whose objective function gap to the optima is at most with high probability using samples, where hides polylogarithmic dependence on . Our proposed method seems to have the best dependence on within the model-free literature without the assumption that all policies generated by the algorithm are stable almost surely, and it matches the best-known rate from the model-based literature, up to logarithmic factors. The improved dependence on is achieved by showing the accuracy scales with the variance rather than the standard deviation of the gradient estimation error. Our developments that result in this improved sampling complexity fall in the category of actor-critic algorithms. The actor part involves a variational inequality formulation of the stochastic LQR problem, while in the critic part, we utilize a conditional stochastic primal-dual method and show that the algorithm has the optimal rate of convergence when paired with a shrinking multi-epoch scheme.
Paper Structure (17 sections, 23 theorems, 71 equations, 1 figure, 3 algorithms)

This paper contains 17 sections, 23 theorems, 71 equations, 1 figure, 3 algorithms.

Key Result

Lemma 2.1

\newlabellem:lqgp_a130 Let $\rho \equiv \rho(A-BK)$. If $K \in \mathcal{S}$, then

Figures (1)

  • Figure 1: The line and shaded region are the median and confidence interval, respectively, of the function gap $J(K_t)-J(K^*)$ (on a log scale) w.r.t. the total samples over 32 seeds. The function gap is shown when a majority (i.e., $\geq60$%) of the seeds have a stable policy. Since the two-time scale AC does not have a stable policy in a majority of seeds after a few hundred samples in the simple environments, we add zoomed-in plots to better display their performance. In contrast, both of our NPG methods output stable policies in every seed. The median last-iterate performance $J(K_t)$ is in the legend. The optimal $J(K^*)$ (left to right) is approximately 0.28, 0.93, 11200.

Theorems & Definitions (35)

  • Lemma 2.1
  • Lemma 2.2
  • Lemma 2.3
  • Lemma 2.4
  • Proposition 3.1
  • Proof 1
  • Lemma 3.2
  • Lemma 3.3
  • Proposition 3.4
  • Proof 2
  • ...and 25 more