Table of Contents
Fetching ...

Approximate Thompson Sampling for Learning Linear Quadratic Regulators with $O(\sqrt{T})$ Regret

Yeoneung Kim, Gihun Kim, Jiwhan Park, Insoon Yang

TL;DR

This work tackles online learning of linear-quadratic regulators under parameter uncertainty by marrying Thompson sampling with preconditioned Langevin dynamics. The core idea is to sample model parameters from an approximate posterior via a preconditioned unadjusted Langevin algorithm (ULA), while injecting a simple end-of-episode excitation to progressively improve posterior concentration through the preconditioner. The authors prove concentration results for both exact and approximate posteriors, establish polynomial-bounded state norms, and derive a Bayesian regret bound of $O(\sqrt{T})$ for the proposed approach, under a relaxed noise assumption of strong log-concavity. The method significantly accelerates posterior sampling and relaxes previous Gaussian-conjugacy requirements, with practical implications for efficient, data-efficient online LQR learning. Overall, the paper provides a principled, scalable framework for Bayesian online control with tight regret guarantees and robust performance under broad noise models.

Abstract

We propose a novel Thompson sampling algorithm that learns linear quadratic regulators (LQR) with a Bayesian regret bound of $O(\sqrt{T})$. Our method leverages Langevin dynamics with a carefully designed preconditioner and incorporates a simple excitation mechanism. We show that the excitation signal drives the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process. Furthermore, we establish nontrivial concentration properties of the approximate posteriors generated by our algorithm. These properties enable us to bound the moments of the system state and attain an $O(\sqrt{T})$ regret bound without relying on the restrictive assumptions that are often used in the literature.

Approximate Thompson Sampling for Learning Linear Quadratic Regulators with $O(\sqrt{T})$ Regret

TL;DR

This work tackles online learning of linear-quadratic regulators under parameter uncertainty by marrying Thompson sampling with preconditioned Langevin dynamics. The core idea is to sample model parameters from an approximate posterior via a preconditioned unadjusted Langevin algorithm (ULA), while injecting a simple end-of-episode excitation to progressively improve posterior concentration through the preconditioner. The authors prove concentration results for both exact and approximate posteriors, establish polynomial-bounded state norms, and derive a Bayesian regret bound of for the proposed approach, under a relaxed noise assumption of strong log-concavity. The method significantly accelerates posterior sampling and relaxes previous Gaussian-conjugacy requirements, with practical implications for efficient, data-efficient online LQR learning. Overall, the paper provides a principled, scalable framework for Bayesian online control with tight regret guarantees and robust performance under broad noise models.

Abstract

We propose a novel Thompson sampling algorithm that learns linear quadratic regulators (LQR) with a Bayesian regret bound of . Our method leverages Langevin dynamics with a carefully designed preconditioner and incorporates a simple excitation mechanism. We show that the excitation signal drives the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process. Furthermore, we establish nontrivial concentration properties of the approximate posteriors generated by our algorithm. These properties enable us to bound the moments of the system state and attain an regret bound without relying on the restrictive assumptions that are often used in the literature.
Paper Structure (38 sections, 23 theorems, 287 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 38 sections, 23 theorems, 287 equations, 8 figures, 2 tables, 1 algorithm.

Key Result

Theorem 2.2

Suppose that $(A, B)$ is stabilizable, and $(A, Q^{1/2})$ is observable. Then, the following algebraic Riccati equation (ARE) has a unique positive definite solution $P^*(\theta)$: Furthermore, the optimal cost function is given by $J (\theta) = \mathrm{tr} (\mathbf{W} P^*(\theta))$, which is continuously differentiable with respect to $\theta$, and the optimal policy is uniquely obtained as $\pi

Figures (8)

  • Figure 1: Infusing noise for enhanced exploration
  • Figure 2: Flow chart of our theoretical results.
  • Figure 3: Filtration and measurability of $(y_s)$ and $(L_s)$.
  • Figure 4: Expected cumulative regret $R(T)$ over a time horizon $T$ using the Gaussian mixture noise for $n=n_u=3$ (left), for $n=n_u=5$ (center), for $n=n_u=10$ (right).
  • Figure 5: System parameter error $|\tilde{\theta}_k-\theta_*| / |\theta_*|$ over episode $k$ using the Gaussian mixture noise for $n=n_u=3$ (left), for $n=n_u=5$ (center), for $n=n_u=10$ (right).
  • ...and 3 more figures

Theorems & Definitions (45)

  • Theorem 2.2
  • Remark 2.3
  • Theorem 2.4
  • Lemma 3.1
  • Remark 3.2
  • Proposition 4.1
  • Proposition 4.2
  • Theorem 4.3
  • Proposition 4.4
  • Theorem 4.5
  • ...and 35 more