Table of Contents
Fetching ...

Revisiting LQR Control from the Perspective of Receding-Horizon Policy Gradient

Xiangyuan Zhang, Tamer Başar

TL;DR

The paper addresses learning optimal LQR controllers in a model-free setting by embedding Bellman optimality into a receding-horizon policy gradient (RHPG) framework. It shows that finite-horizon subproblems, solved via zeroth-order policy gradient, can be combined backward in time to yield an overall policy that is ε-close to the infinite-horizon optimum without requiring an initial stabilizing policy. A key contribution is a precise sample-complexity analysis demonstrating that each subproblem converges in \mathcal{O}(ε^{-2} log(1/δ)) steps and that the total complexity scales polylogarithmically with horizon and linearly with 1/δ, while guaranteeing stability for sufficiently small ε. The results unify DP principles with model-free learning to address both control and estimation tasks, including extensions to stochastic LQR and arbitrary initial states, highlighting the practical impact for data-driven linear control with performance guarantees.

Abstract

We revisit in this paper the discrete-time linear quadratic regulator (LQR) problem from the perspective of receding-horizon policy gradient (RHPG), a newly developed model-free learning framework for control applications. We provide a fine-grained sample complexity analysis for RHPG to learn a control policy that is both stabilizing and $ε$-close to the optimal LQR solution, and our algorithm does not require knowing a stabilizing control policy for initialization. Combined with the recent application of RHPG in learning the Kalman filter, we demonstrate the general applicability of RHPG in linear control and estimation with streamlined analyses.

Revisiting LQR Control from the Perspective of Receding-Horizon Policy Gradient

TL;DR

The paper addresses learning optimal LQR controllers in a model-free setting by embedding Bellman optimality into a receding-horizon policy gradient (RHPG) framework. It shows that finite-horizon subproblems, solved via zeroth-order policy gradient, can be combined backward in time to yield an overall policy that is ε-close to the infinite-horizon optimum without requiring an initial stabilizing policy. A key contribution is a precise sample-complexity analysis demonstrating that each subproblem converges in \mathcal{O}(ε^{-2} log(1/δ)) steps and that the total complexity scales polylogarithmically with horizon and linearly with 1/δ, while guaranteeing stability for sufficiently small ε. The results unify DP principles with model-free learning to address both control and estimation tasks, including extensions to stochastic LQR and arbitrary initial states, highlighting the practical impact for data-driven linear control with performance guarantees.

Abstract

We revisit in this paper the discrete-time linear quadratic regulator (LQR) problem from the perspective of receding-horizon policy gradient (RHPG), a newly developed model-free learning framework for control applications. We provide a fine-grained sample complexity analysis for RHPG to learn a control policy that is both stabilizing and -close to the optimal LQR solution, and our algorithm does not require knowing a stabilizing control policy for initialization. Combined with the recent application of RHPG in learning the Kalman filter, we demonstrate the general applicability of RHPG in linear control and estimation with streamlined analyses.
Paper Structure (20 sections, 4 theorems, 60 equations, 2 figures, 1 algorithm)

This paper contains 20 sections, 4 theorems, 60 equations, 2 figures, 1 algorithm.

Key Result

Theorem 3.1

Let $A_K^*:=A-BK^*$, use $\|\cdot\|_*$ to denote the $P^*$-induced norm, and define where it holds that $\|A_K^*\|_* <1$. Then, for all $N\geq N_0$, the control policy $K^*_{0}$ computed by eqn:finite_lqr_gain satisfies $\|K^*_{0} - K^*\| \leq \epsilon$ for any $\epsilon > 0$.

Figures (2)

  • Figure 1: We first show that the output policy $\widetilde{K}_{0}$ can be made $\epsilon$-close to $K^*$ in two steps. First, Theorem \ref{['lemma:finite_approximation']} proves that $K^*_{0}$ is $\epsilon$-close to $K^*$ by selecting $N$ accordingly. Then, Theorem \ref{['theorem:LQR_DP']} analyzes the backward propagation of the computational errors from solving each subproblem, denoted as $\delta_t:=\widetilde{K}_t - \widetilde{K}^*_t$ for all $t$, where $\widetilde{K}^*_t$ represents the current optimal LQR policy after absorbing errors from all previous iterations. Then, we show that if one requires a small enough optimality gap $\epsilon$ between $\widetilde{K}_0$ and $K^*$, then the RHPG output $\widetilde{K}_0$ can automatically acquire a closed-loop stability certificate.
  • Figure 2: For six different values of $\epsilon$: Left: policy error between the output and $K^*$. Right: the total number of calls to the (two-point) zeroth-order oracle.

Theorems & Definitions (5)

  • Theorem 3.1
  • Theorem 3.2
  • Remark 3.3
  • Proposition 3.4
  • Lemma 1.1