Any-Time Regret-Guaranteed Algorithm for Control of Linear Quadratic Systems
Jafar Abbaszadeh Chekan, Cedric Langbort
TL;DR
This work develops anytime regret guarantees for learning-based LQR control with unknown dynamics by embedding SDP-based policy design inside an optimism-in-the-face-of-uncertainty framework and injecting carefully scaled input perturbations. It introduces two algorithmic variants: ARSLO, which enforces strong sequential stability, and ARSLO^+(ar{ ho}), which relaxes this notion using a dwell-time inspired update rule to improve regret while preserving high-probability state bounds. A warm-up phase eliminates the need for a priori bounds on the DARE solution J_∗, and the analysis provides explicit state-norm bounds and system-theoretic regret guarantees that depend on the DARE solution P_∗ and system dimensions. Collectively, the paper advances convex, computationally efficient OFU-based LQR control with anytime guarantees, without requiring prior knowledge of J_∗, and clarifies the trade-offs between stability constraints and regret performance.
Abstract
We propose a computationally efficient algorithm that achieves anytime regret of order $\mathcal{O}(\sqrt{t})$, with explicit dependence on the system dimensions and on the solution of the Discrete Algebraic Riccati Equation (DARE). Our approach uses an appropriately tuned regularization and a sufficiently accurate initial estimate to construct confidence ellipsoids for control design. A carefully designed input-perturbation mechanism is incorporated to ensure anytime performance. We develop two variants of the algorithm. The first enforces strong sequential stability, requiring each policy to be stabilizing and successive policies to remain close. This sequential condition helps prevent state explosion at policy update times; however, it results in a suboptimal regret scaling with respect to the DARE solution. Motivated by this limitation, we introduce a second class of algorithms that removes this requirement and instead requires only that each generated policy be stabilizing. Closed-loop stability is then preserved through a dwell-time inspired policy-update rule. This class of algorithms also addresses key shortcomings of most existing approaches which lack explicit high-probability bounds on the state trajectory expressed in system-theoretic terms. Our analysis shows that partially relaxing the sequential-stability requirement yields optimal regret. Finally, our method eliminates the need for any \emph{a priori} bound on the norm of the DARE solution, an assumption required by all existing computationally efficient OFU based algorithms.
