An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits
Amaury Gouverneur, Borja Rodríguez-Gálvez, Tobias J. Oechtering, Mikael Skoglund
TL;DR
The paper tackles logistic bandits with binary feedback and a logistic reward model, addressing the challenge that prior bounds scaled poorly with β. By leveraging the information-ratio framework for Thompson Sampling and introducing a quantized-parameter analysis, the authors prove a tight bound Γ_t ≤ (9/2) d α^{-2} that is independent of β. This leads to a Bayesian regret bound of order O(d/α · √(T log(βT/d))) and, in setups where the action space contains the parameter space, to tilde{O}(d √T) regret, marking the first such β-logarithmic, action-count-insensitive results for logistic bandits. The results hinge on bounding the information gained about the optimal action via mutual-information decompositions, surrogate variance controls, and a careful asymptotic analysis as β → ∞. Overall, the work advances both theoretical understanding and practical applicability of TS in nonlinear bandit settings and suggests directions for extending to generalized linear models and frequentist guarantees.
Abstract
We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $\exp(β\langle a, θ\rangle)/(1+\exp(β\langle a, θ\rangle))$, with slope parameter $β>0$, and where both the action $a\in \mathcal{A}$ and parameter $θ\in \mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo and Van Roy (2016), we analyze the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred and the information gained about the optimal action. We improve upon previous results by establishing that the information ratio is bounded by $\tfrac{9}{2}dα^{-2}$, where $α$ is a minimax measure of the alignment between the action space $\mathcal{A}$ and the parameter space $\mathcal{O}$, and is independent of $β$. Using this result, we derive a bound of order $O(d/α\sqrt{T \log(βT/d)})$ on the Bayesian expected regret of Thompson Sampling incurred after $T$ time steps. To our knowledge, this is the first regret bound for logistic bandits that depends only logarithmically on $β$ while being independent of the number of actions. In particular, when the action space contains the parameter space, the bound on the expected regret is of order $\tilde{O}(d \sqrt{T})$.
