Thompson Sampling-Based Learning and Control for Unknown Dynamic Systems
Kaikai Zheng, Dawei Shi, Yang Shi, Long Wang
TL;DR
This paper addresses learning-based control for unknown nonlinear dynamics by combining Thompson Sampling with a Hilbert-space (RKHS) parameterization of controllers. It constructs a convex, RKHS-backed function space around an initial controller and uses TS to explore and exploit candidate control laws, with a Bayesian posterior update for the unknown cost function. The authors prove exponential convergence of the learned cost to a neighborhood of the true cost, derive a finite upper bound on control regret that separates a stationary approximation error from a decaying exploration term, and analyze closed-loop mean-square stability under the learning process. Extensions include adaptive segment lengths, nonstationary reward handling, and stability guarantees, with computational complexity characterized for practical kernel-based implementations. Numerical simulations on unknown nonlinear systems validate rapid convergence, meaningful regret bounds, and favorable performance compared to GP-MPC, highlighting the approach’s robustness and scalability for data-driven control without explicit system identification.
Abstract
Thompson sampling (TS) is a Bayesian randomized exploration strategy that samples options (e.g., system parameters or control laws) from the current posterior and then applies the selected option that is optimal for a task, thereby balancing exploration and exploitation; this makes TS effective for active learning-based controller design. However, TS relies on finite parametric representations, which limits its applicability to more general spaces, which are more commonly encountered in control system design. To address this issue, this work proposes a parameterization method for control law learning using reproducing kernel Hilbert spaces and designs a data-driven active learning control approach. Specifically, the proposed method treats the control law as an element in a function space, allowing the design of control laws without imposing restrictions on the system structure or the form of the controller. A TS framework is proposed in this work to reduce control costs through online exploration and exploitation, and the convergence guarantees are further provided for the learning process. Theoretical analysis shows that the proposed method learns the relationship between control laws and closed-loop performance metrics at an exponential rate, and the upper bound of control regret is also derived. Furthermore, the closed-loop stability of the proposed learning framework is analyzed. Numerical experiments on controlling unknown nonlinear systems validate the effectiveness of the proposed method.
