Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning
Ahmadreza Moradipari, Mohammad Pedramfar, Modjtaba Shokrian Zini, Vaneet Aggarwal
TL;DR
This work develops Bayesian regret guarantees for Thompson Sampling in time-inhomogeneous reinforcement learning, addressing both Bayesian transitions and rewards. It introduces surrogate environments via an $\varepsilon$-value partition and analyzes the information ratio through a posterior-consistency framework, yielding a regret bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$. The framework provides concrete bounds for tabular, linear, and finite mixtures RL and resolves gaps in prior nonlinear analyses by delivering correct surrogate construction and dimension bounds. Key concepts include the value diameter $\lambda$, reflecting value-function variation across states, and the surrogate dimension $d_{l_1}$, tying regret to environment-space complexity; the results rely on posterior consistency and surrogate learning to achieve dimension-aware regret guarantees without restrictive priors, and they conjecture near-optimality up to a $\sqrt{H}$ factor in the time-inhomogeneous setting.
Abstract
In this paper, we prove the first Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We simplify the learning problem using a discrete set of surrogate environments, and present a refined analysis of the information ratio using posterior consistency. This leads to an upper bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$ in the time inhomogeneous reinforcement learning problem where $H$ is the episode length and $d_{l_1}$ is the Kolmogorov $l_1-$dimension of the space of environments. We then find concrete bounds of $d_{l_1}$ in a variety of settings, such as tabular, linear and finite mixtures, and discuss how how our results are either the first of their kind or improve the state-of-the-art.
