Approximate Thompson Sampling for Learning Linear Quadratic Regulators with $O(\sqrt{T})$ Regret
Yeoneung Kim, Gihun Kim, Jiwhan Park, Insoon Yang
TL;DR
This work tackles online learning of linear-quadratic regulators under parameter uncertainty by marrying Thompson sampling with preconditioned Langevin dynamics. The core idea is to sample model parameters from an approximate posterior via a preconditioned unadjusted Langevin algorithm (ULA), while injecting a simple end-of-episode excitation to progressively improve posterior concentration through the preconditioner. The authors prove concentration results for both exact and approximate posteriors, establish polynomial-bounded state norms, and derive a Bayesian regret bound of $O(\sqrt{T})$ for the proposed approach, under a relaxed noise assumption of strong log-concavity. The method significantly accelerates posterior sampling and relaxes previous Gaussian-conjugacy requirements, with practical implications for efficient, data-efficient online LQR learning. Overall, the paper provides a principled, scalable framework for Bayesian online control with tight regret guarantees and robust performance under broad noise models.
Abstract
We propose a novel Thompson sampling algorithm that learns linear quadratic regulators (LQR) with a Bayesian regret bound of $O(\sqrt{T})$. Our method leverages Langevin dynamics with a carefully designed preconditioner and incorporates a simple excitation mechanism. We show that the excitation signal drives the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process. Furthermore, we establish nontrivial concentration properties of the approximate posteriors generated by our algorithm. These properties enable us to bound the moments of the system state and attain an $O(\sqrt{T})$ regret bound without relying on the restrictive assumptions that are often used in the literature.
