Sparsity-Agnostic Linear Bandits with Adaptive Adversaries
Tianyuan Jin, Kyoungseok Jang, Nicolò Cesa-Bianchi
TL;DR
This work addresses sparsity-agnostic stochastic linear bandits under adaptive adversaries by introducing SparseLinUCB, a multi-level confidence-set algorithm that achieves $\tilde{O}(S\sqrt{dT})$ regret without prior knowledge of the sparsity level or strong assumptions on the action sets. It leverages online-to-confidence-set conversions with a hierarchy of radii and a base sparse online learner (SeqSEW) to obtain robust guarantees, plus an instance-dependent bound tying regret to the suboptimality gap $\Delta$. The paper also proposes AdaLinUCB, which uses Exp3 to adaptively weight confidence-set radii, achieving $\tilde{O}(\max\{\sqrt{dq},\sqrt{S/q}\}\sqrt{dT})$ and offering practical empirical improvements over OFUL. Through theoretical and empirical results, the work demonstrates robust, sparsity-aware learning in adversarial environments and highlights directions for tightening bounds and relaxing noise assumptions.
Abstract
We study stochastic linear bandits where, in each round, the learner receives a set of actions (i.e., feature vectors), from which it chooses an element and obtains a stochastic reward. The expected reward is a fixed but unknown linear function of the chosen action. We study sparse regret bounds, that depend on the number $S$ of non-zero coefficients in the linear reward function. Previous works focused on the case where $S$ is known, or the action sets satisfy additional assumptions. In this work, we obtain the first sparse regret bounds that hold when $S$ is unknown and the action sets are adversarially generated. Our techniques combine online to confidence set conversions with a novel randomized model selection approach over a hierarchy of nested confidence sets. When $S$ is known, our analysis recovers state-of-the-art bounds for adversarial action sets. We also show that a variant of our approach, using Exp3 to dynamically select the confidence sets, can be used to improve the empirical performance of stochastic linear bandits while enjoying a regret bound with optimal dependence on the time horizon.
