LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits
Masahiro Kato, Shinji Ito
TL;DR
This work delivers a practical BoBW algorithm for linear contextual bandits by leveraging FTRL with Tsallis entropy, producing tight regret bounds that scale favorably with time in stochastic settings and remain robust in adversarial scenarios. By introducing the α-LC-Tsallis-INF and leveraging a regression estimator along with a context-aware exploration policy, the authors achieve O(log T) regret under a margin condition and O(√T) in adversarial regimes, with improved T-dependence compared to Shannon-entropy variants. The analysis encompasses arm-dependent feature settings and includes a regret-transformation framework to translate results between arm-dependent and arm-independent formulations, plus practical considerations on computation via MGR and exact Σ^{-1} when feasible. These contributions advance BoBW strategies for linear contextual bandits and offer practically implementable alternatives to black-box BoBW methods, with clear pathways to extend under milder margin conditions. The work thus provides a theoretically grounded, computation-conscious approach with meaningful implications for robust online decision-making in dynamic environments.
Abstract
We investigate the \emph{linear contextual bandit problem} with independent and identically distributed (i.i.d.) contexts. In this problem, we aim to develop a \emph{Best-of-Both-Worlds} (BoBW) algorithm with regret upper bounds in both stochastic and adversarial regimes. We develop an algorithm based on \emph{Follow-The-Regularized-Leader} (FTRL) with Tsallis entropy, referred to as the $α$-\emph{Linear-Contextual (LC)-Tsallis-INF}. We show that its regret is at most $O(\log(T))$ in the stochastic regime under the assumption that the suboptimality gap is uniformly bounded from below, and at most $O(\sqrt{T})$ in the adversarial regime. Furthermore, our regret analysis is extended to more general regimes characterized by the \emph{margin condition} with a parameter $β\in (1, \infty]$, which imposes a milder assumption on the suboptimality gap. We show that the proposed algorithm achieves $O\left(\log(T)^{\frac{1+β}{2+β}}T^{\frac{1}{2+β}}\right)$ regret under the margin condition.
