Fast UCB-type algorithms for stochastic bandits with heavy and super heavy symmetric noise
Yuriy Dorn, Aleksandr Katrutsa, Ilgam Latypov, Andrey Pudovikov
TL;DR
The paper addresses stochastic multi-armed bandits with heavy-tailed rewards by recasting reward estimation as per-arm convex optimization problems solved with inexact oracles. It introduces the $g(k,\delta)$-bounded framework and FO-/ZO-UCB algorithms that derive regret bounds in terms of $g^{-1}$, and it presents Clipped-SGD-UCB as a practical instantiation that achieves favorable regret, including the $O(\log T\sqrt{KT\log T})$ rate under symmetric noise or even no-expectation cases. Theoretical results connect optimization convergence to bandit regret, and extensive experiments across super-heavy-tail, heavy-tail, and Gaussian settings demonstrate competitive performance and real-time advantages over robust UCB baselines. Overall, the work provides a scalable, optimization-driven path to robust UCB for heavy-tailed bandits with practical computational benefits and broad applicability in risk-sensitive environments.
Abstract
In this study, we propose a new method for constructing UCB-type algorithms for stochastic multi-armed bandits based on general convex optimization methods with an inexact oracle. We derive the regret bounds corresponding to the convergence rates of the optimization methods. We propose a new algorithm Clipped-SGD-UCB and show, both theoretically and empirically, that in the case of symmetric noise in the reward, we can achieve an $O(\log T\sqrt{KT\log T})$ regret bound instead of $O\left (T^{\frac{1}{1+α}} K^{\fracα{1+α}} \right)$ for the case when the reward distribution satisfies $\mathbb{E}_{X \in D}[|X|^{1+α}] \leq σ^{1+α}$ ($α\in (0, 1])$, i.e. perform better than it is assumed by the general lower bound for bandits with heavy-tails. Moreover, the same bound holds even when the reward distribution does not have the expectation, that is, when $α<0$.
