Best-of-Both-Worlds Algorithms for Linear Contextual Bandits

Yuko Kuroki; Alberto Rumi; Taira Tsuchiya; Fabio Vitale; Nicolò Cesa-Bianchi

Best-of-Both-Worlds Algorithms for Linear Contextual Bandits

Yuko Kuroki, Alberto Rumi, Taira Tsuchiya, Fabio Vitale, Nicolò Cesa-Bianchi

TL;DR

This work tackles best-of-both-worlds guarantees for $K$-armed linear contextual bandits, achieving near-optimal regret in both adversarial and stochastic settings without environment knowledge. It develops two practical approaches: (i) a data-dependent MWU-LC framework built on a black-box reduction and loss predictors yielding first-/second-order adversarial bounds and polylog stochastic bounds, and (ii) a Sigma-free FTRL-LC method that uses Matrix Geometric Resampling to estimate $\Sigma^{-1}$ and attains competitive BoBW bounds with an emphasis on computational efficiency. The results include polylogarithmic stochastic regret $\tilde{O}\left( \frac{(dK)^2}{\Delta_{\min}} \mathrm{poly}\log(dKT) \right)$ and first-/second-order adversarial bounds $\tilde{O}(dK\sqrt{L^*})$ or $\tilde{O}(dK\sqrt{\Lambda^*})$, as well as a $\tilde{O}(dK\sqrt{T})$ adversarial bound for the FTRL-Shannon approach. The methods extend to corrupted stochastic regimes and avoid prohibitive policy-space computations, offering practical, data-dependent BoBW guarantees for linear contextual bandits with high relevance to adaptive decision-making under partial feedback.

Abstract

We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate $\frac{(dK)^2\mathrm{poly}\log(dKT)}{Δ_{\min}}$, where $Δ_{\min}$ is the minimum suboptimality gap over the $d$-dimensional context space. In the adversarial regime, we obtain either the first-order $\widetilde{O}(dK\sqrt{L^*})$ bound, or the second-order $\widetilde{O}(dK\sqrt{Λ^*})$ bound, where $L^*$ is the cumulative loss of the best action and $Λ^*$ is a notion of the cumulative second moment for the losses incurred by the algorithm. Moreover, we develop an algorithm based on FTRL with Shannon entropy regularizer that does not require the knowledge of the inverse of the covariance matrix, and achieves a polylogarithmic regret in the stochastic regime while obtaining $\widetilde{O}\big(dK\sqrt{T}\big)$ regret bounds in the adversarial regime.

Best-of-Both-Worlds Algorithms for Linear Contextual Bandits

TL;DR

This work tackles best-of-both-worlds guarantees for

-armed linear contextual bandits, achieving near-optimal regret in both adversarial and stochastic settings without environment knowledge. It develops two practical approaches: (i) a data-dependent MWU-LC framework built on a black-box reduction and loss predictors yielding first-/second-order adversarial bounds and polylog stochastic bounds, and (ii) a Sigma-free FTRL-LC method that uses Matrix Geometric Resampling to estimate

and attains competitive BoBW bounds with an emphasis on computational efficiency. The results include polylogarithmic stochastic regret

and first-/second-order adversarial bounds

, as well as a

adversarial bound for the FTRL-Shannon approach. The methods extend to corrupted stochastic regimes and avoid prohibitive policy-space computations, offering practical, data-dependent BoBW guarantees for linear contextual bandits with high relevance to adaptive decision-making under partial feedback.

Abstract

We study best-of-both-worlds algorithms for

-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and stochastic regimes, without prior knowledge about the environment. In the stochastic regime, we achieve the polylogarithmic rate

, where

is the minimum suboptimality gap over the

-dimensional context space. In the adversarial regime, we obtain either the first-order

bound, or the second-order

bound, where

is the cumulative loss of the best action and

is a notion of the cumulative second moment for the losses incurred by the algorithm. Moreover, we develop an algorithm based on FTRL with Shannon entropy regularizer that does not require the knowledge of the inverse of the covariance matrix, and achieves a polylogarithmic regret in the stochastic regime while obtaining

regret bounds in the adversarial regime.

Paper Structure (30 sections, 34 theorems, 172 equations, 2 tables, 7 algorithms)

This paper contains 30 sections, 34 theorems, 172 equations, 2 tables, 7 algorithms.

INTRODUCTION
Contributions.
Techniques.
PROBLEM STATEMENT
FOLLOW-THE-REGULARIZED-LEADER
DATA-DEPENDENT BOUNDS
UNKNOWN $\mathbf{\Sigma}^{-1}$ CASE
Conclusions
NOTATION
ADDITIONAL RELATED WORK
LOWER BOUND
USEFUL LEMMAS
Analysis of FTRL
Fundamental bounds for $K$-armed linear contextual bandits
APPENDIX FOR REDUCTION APPROACH
...and 15 more sections

Key Result

Proposition 1

Assume that $\bm{\overline{\Sigma}}_{t,a}$ in eq: def of barSigma and $\bm{\overline{\Sigma}}_{t,a}$ in eq: def tilde Sigma are known to the learner at each round $t$ and action $a$. Given an adaptive sequence of weights $q_1, q_2, \ldots \in (0,1]$, suppose that MWU-LC observes the feedback in roun

Theorems & Definitions (58)

Proposition 1
Theorem 1
Corollary 1
Remark 1
Theorem 2
Lemma 1
Lemma 2
Lemma 3: Entropy-dependent regret bound for the auxiliary game
Proposition 2: Theorem 19 in Zierahn+2023
Lemma 4
...and 48 more

Best-of-Both-Worlds Algorithms for Linear Contextual Bandits

TL;DR

Abstract

Best-of-Both-Worlds Algorithms for Linear Contextual Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (58)