Table of Contents
Fetching ...

LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits

Masahiro Kato, Shinji Ito

TL;DR

This work delivers a practical BoBW algorithm for linear contextual bandits by leveraging FTRL with Tsallis entropy, producing tight regret bounds that scale favorably with time in stochastic settings and remain robust in adversarial scenarios. By introducing the α-LC-Tsallis-INF and leveraging a regression estimator along with a context-aware exploration policy, the authors achieve O(log T) regret under a margin condition and O(√T) in adversarial regimes, with improved T-dependence compared to Shannon-entropy variants. The analysis encompasses arm-dependent feature settings and includes a regret-transformation framework to translate results between arm-dependent and arm-independent formulations, plus practical considerations on computation via MGR and exact Σ^{-1} when feasible. These contributions advance BoBW strategies for linear contextual bandits and offer practically implementable alternatives to black-box BoBW methods, with clear pathways to extend under milder margin conditions. The work thus provides a theoretically grounded, computation-conscious approach with meaningful implications for robust online decision-making in dynamic environments.

Abstract

We investigate the \emph{linear contextual bandit problem} with independent and identically distributed (i.i.d.) contexts. In this problem, we aim to develop a \emph{Best-of-Both-Worlds} (BoBW) algorithm with regret upper bounds in both stochastic and adversarial regimes. We develop an algorithm based on \emph{Follow-The-Regularized-Leader} (FTRL) with Tsallis entropy, referred to as the $α$-\emph{Linear-Contextual (LC)-Tsallis-INF}. We show that its regret is at most $O(\log(T))$ in the stochastic regime under the assumption that the suboptimality gap is uniformly bounded from below, and at most $O(\sqrt{T})$ in the adversarial regime. Furthermore, our regret analysis is extended to more general regimes characterized by the \emph{margin condition} with a parameter $β\in (1, \infty]$, which imposes a milder assumption on the suboptimality gap. We show that the proposed algorithm achieves $O\left(\log(T)^{\frac{1+β}{2+β}}T^{\frac{1}{2+β}}\right)$ regret under the margin condition.

LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits

TL;DR

This work delivers a practical BoBW algorithm for linear contextual bandits by leveraging FTRL with Tsallis entropy, producing tight regret bounds that scale favorably with time in stochastic settings and remain robust in adversarial scenarios. By introducing the α-LC-Tsallis-INF and leveraging a regression estimator along with a context-aware exploration policy, the authors achieve O(log T) regret under a margin condition and O(√T) in adversarial regimes, with improved T-dependence compared to Shannon-entropy variants. The analysis encompasses arm-dependent feature settings and includes a regret-transformation framework to translate results between arm-dependent and arm-independent formulations, plus practical considerations on computation via MGR and exact Σ^{-1} when feasible. These contributions advance BoBW strategies for linear contextual bandits and offer practically implementable alternatives to black-box BoBW methods, with clear pathways to extend under milder margin conditions. The work thus provides a theoretically grounded, computation-conscious approach with meaningful implications for robust online decision-making in dynamic environments.

Abstract

We investigate the \emph{linear contextual bandit problem} with independent and identically distributed (i.i.d.) contexts. In this problem, we aim to develop a \emph{Best-of-Both-Worlds} (BoBW) algorithm with regret upper bounds in both stochastic and adversarial regimes. We develop an algorithm based on \emph{Follow-The-Regularized-Leader} (FTRL) with Tsallis entropy, referred to as the -\emph{Linear-Contextual (LC)-Tsallis-INF}. We show that its regret is at most in the stochastic regime under the assumption that the suboptimality gap is uniformly bounded from below, and at most in the adversarial regime. Furthermore, our regret analysis is extended to more general regimes characterized by the \emph{margin condition} with a parameter , which imposes a milder assumption on the suboptimality gap. We show that the proposed algorithm achieves regret under the margin condition.
Paper Structure (54 sections, 25 theorems, 134 equations, 1 table, 2 algorithms)

This paper contains 54 sections, 25 theorems, 134 equations, 1 table, 2 algorithms.

Key Result

Theorem 4.1

Consider the $1/2$-LC-Tsallis-INF. Assumptions asm:contextual_dist and asm:contextual_dist2--asm:bounded_random hold. Then, the regret satisfies and $\omega_t \in [0, 1]$ is given as

Theorems & Definitions (48)

  • Definition 2.5: Stochastic regime with a margin condition
  • Definition 2.6: Adversarial regime with a self-bounding constraint
  • Theorem 4.1: General regret bounds
  • Theorem 4.2: Regret upper bound in an adversarial regime
  • Theorem 4.3: Regret upper bound in a stochastic regime with a margin condition
  • Theorem 4.4: Regret upper bound in an adversarial regime with a self-bounding constraint
  • Theorem 4.5
  • Proposition A.1
  • proof : Proof of Proposition \ref{['lem:basic']}
  • Lemma A.2
  • ...and 38 more