Chained Information-Theoretic bounds and Tight Regret Rate for Linear Bandit Problems
Amaury Gouverneur, Borja Rodríguez-Gálvez, Tobias J. Oechtering, Mikael Skoglund
TL;DR
The paper addresses regret bounds for bandit problems with metric action spaces by extending information-theoretic analyses to a chaining framework. It introduces Two Steps Thompson Sampling and a chain-link information ratio to leverage reward continuity across nearby actions, yielding a regret bound that depends on the metric entropy of the action space. For d-dimensional linear bandits with smooth rewards, the authors obtain a tight $O(d\\sqrt{T})$ regret rate and a unit-ball bound that matches the minimax rate $\\Omega(d\\sqrt{T})$, extending the applicability of information-theoretic methods to continuous action spaces. The results suggest that the Two Steps variant and chaining techniques can outperform prior $O(d\\sqrt{T\\log T})$ bounds and open avenues for generalized linear and logistic bandits in continuous settings.
Abstract
This paper studies the Bayesian regret of a variant of the Thompson-Sampling algorithm for bandit problems. It builds upon the information-theoretic framework of [Russo and Van Roy, 2015] and, more specifically, on the rate-distortion analysis from [Dong and Van Roy, 2020], where they proved a bound with regret rate of $O(d\sqrt{T \log(T)})$ for the $d$-dimensional linear bandit setting. We focus on bandit problems with a metric action space and, using a chaining argument, we establish new bounds that depend on the metric entropy of the action space for a variant of Thompson-Sampling. Under suitable continuity assumption of the rewards, our bound offers a tight rate of $O(d\sqrt{T})$ for $d$-dimensional linear bandit problems.
