An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits

Amaury Gouverneur; Borja Rodríguez-Gálvez; Tobias J. Oechtering; Mikael Skoglund

An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits

Amaury Gouverneur, Borja Rodríguez-Gálvez, Tobias J. Oechtering, Mikael Skoglund

TL;DR

The paper tackles logistic bandits with binary feedback and a logistic reward model, addressing the challenge that prior bounds scaled poorly with β. By leveraging the information-ratio framework for Thompson Sampling and introducing a quantized-parameter analysis, the authors prove a tight bound Γ_t ≤ (9/2) d α^{-2} that is independent of β. This leads to a Bayesian regret bound of order O(d/α · √(T log(βT/d))) and, in setups where the action space contains the parameter space, to tilde{O}(d √T) regret, marking the first such β-logarithmic, action-count-insensitive results for logistic bandits. The results hinge on bounding the information gained about the optimal action via mutual-information decompositions, surrogate variance controls, and a careful asymptotic analysis as β → ∞. Overall, the work advances both theoretical understanding and practical applicability of TS in nonlinear bandit settings and suggests directions for extending to generalized linear models and frequentist guarantees.

Abstract

We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $\exp(β\langle a, θ\rangle)/(1+\exp(β\langle a, θ\rangle))$, with slope parameter $β>0$, and where both the action $a\in \mathcal{A}$ and parameter $θ\in \mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo and Van Roy (2016), we analyze the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred and the information gained about the optimal action. We improve upon previous results by establishing that the information ratio is bounded by $\tfrac{9}{2}dα^{-2}$, where $α$ is a minimax measure of the alignment between the action space $\mathcal{A}$ and the parameter space $\mathcal{O}$, and is independent of $β$. Using this result, we derive a bound of order $O(d/α\sqrt{T \log(βT/d)})$ on the Bayesian expected regret of Thompson Sampling incurred after $T$ time steps. To our knowledge, this is the first regret bound for logistic bandits that depends only logarithmically on $β$ while being independent of the number of actions. In particular, when the action space contains the parameter space, the bound on the expected regret is of order $\tilde{O}(d \sqrt{T})$.

An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits

TL;DR

Abstract

We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function,

, with slope parameter

, and where both the action

and parameter

lie within the

-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo and Van Roy (2016), we analyze the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred and the information gained about the optimal action. We improve upon previous results by establishing that the information ratio is bounded by

, where

is a minimax measure of the alignment between the action space

and the parameter space

, and is independent of

. Using this result, we derive a bound of order

on the Bayesian expected regret of Thompson Sampling incurred after

time steps. To our knowledge, this is the first regret bound for logistic bandits that depends only logarithmically on

while being independent of the number of actions. In particular, when the action space contains the parameter space, the bound on the expected regret is of order

An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits

TL;DR

Abstract

An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (35)