Fast UCB-type algorithms for stochastic bandits with heavy and super heavy symmetric noise

Yuriy Dorn; Aleksandr Katrutsa; Ilgam Latypov; Andrey Pudovikov

Fast UCB-type algorithms for stochastic bandits with heavy and super heavy symmetric noise

Yuriy Dorn, Aleksandr Katrutsa, Ilgam Latypov, Andrey Pudovikov

TL;DR

The paper addresses stochastic multi-armed bandits with heavy-tailed rewards by recasting reward estimation as per-arm convex optimization problems solved with inexact oracles. It introduces the $g(k,\delta)$-bounded framework and FO-/ZO-UCB algorithms that derive regret bounds in terms of $g^{-1}$, and it presents Clipped-SGD-UCB as a practical instantiation that achieves favorable regret, including the $O(\log T\sqrt{KT\log T})$ rate under symmetric noise or even no-expectation cases. Theoretical results connect optimization convergence to bandit regret, and extensive experiments across super-heavy-tail, heavy-tail, and Gaussian settings demonstrate competitive performance and real-time advantages over robust UCB baselines. Overall, the work provides a scalable, optimization-driven path to robust UCB for heavy-tailed bandits with practical computational benefits and broad applicability in risk-sensitive environments.

Abstract

In this study, we propose a new method for constructing UCB-type algorithms for stochastic multi-armed bandits based on general convex optimization methods with an inexact oracle. We derive the regret bounds corresponding to the convergence rates of the optimization methods. We propose a new algorithm Clipped-SGD-UCB and show, both theoretically and empirically, that in the case of symmetric noise in the reward, we can achieve an $O(\log T\sqrt{KT\log T})$ regret bound instead of $O\left (T^{\frac{1}{1+α}} K^{\fracα{1+α}} \right)$ for the case when the reward distribution satisfies $\mathbb{E}_{X \in D}[|X|^{1+α}] \leq σ^{1+α}$ ($α\in (0, 1])$, i.e. perform better than it is assumed by the general lower bound for bandits with heavy-tails. Moreover, the same bound holds even when the reward distribution does not have the expectation, that is, when $α<0$.

Fast UCB-type algorithms for stochastic bandits with heavy and super heavy symmetric noise

TL;DR

The paper addresses stochastic multi-armed bandits with heavy-tailed rewards by recasting reward estimation as per-arm convex optimization problems solved with inexact oracles. It introduces the

-bounded framework and FO-/ZO-UCB algorithms that derive regret bounds in terms of

, and it presents Clipped-SGD-UCB as a practical instantiation that achieves favorable regret, including the

rate under symmetric noise or even no-expectation cases. Theoretical results connect optimization convergence to bandit regret, and extensive experiments across super-heavy-tail, heavy-tail, and Gaussian settings demonstrate competitive performance and real-time advantages over robust UCB baselines. Overall, the work provides a scalable, optimization-driven path to robust UCB for heavy-tailed bandits with practical computational benefits and broad applicability in risk-sensitive environments.

Abstract

regret bound instead of

for the case when the reward distribution satisfies

(

, i.e. perform better than it is assumed by the general lower bound for bandits with heavy-tails. Moreover, the same bound holds even when the reward distribution does not have the expectation, that is, when

Paper Structure (25 sections, 10 theorems, 43 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 25 sections, 10 theorems, 43 equations, 7 figures, 1 table, 1 algorithm.

Introduction
Related works
Contributions
UCB via stochastic optimization algorithms
Optimization methods with inexact oracle
FO-UCB and ZO-UCB algorithms
Convergence of FO-UCB
Convergence of ZO-UCB
Clipped-SGD-UCB
Clipped-SGD
Clipped-SGD-UCB
Numerical Experiments
Initialization of reward estimates.
Super-heavy tail MAB
Convergence comparison.
...and 10 more sections

Key Result

Theorem 2

The regret of the FO-UCB with $g(k, \delta)$-bounded first-order algorithm for the MAB problem with $K$ arms, auxiliary functions $f_i(x) = \frac{1}{2}(x-\mu_i)^2$, period $T$, $\delta=\frac{1}{T^2}$ satisfies

Figures (7)

Figure 1: The convergence of the regret metric (the first row) and the mean regret metric (the second row) for the considered algorithms with Cauchy distribution ($\gamma=1$) of a reward noise. We report the averaged values over 120 trials and the corresponding standard deviation area via shaded regions. Our algorithms show faster convergence in Env1 and Env2 compared to competitors and slightly slower convergence than RUCB-Median in Env3.
Figure 2: The convergence of the regret metric (the first row) and the mean regret metric (the second row) for the considered algorithms with Fréchet distribution ($\alpha=1.25$) of a reward noise. We report the averaged values over 120 trials and the corresponding standard deviation area via shaded regions. Our algorithms show faster convergence in Env2, asymptotically faster convergence in Env3, and slower convergence than RUCB-Median in Env1.
Figure 3: Convergence of the considered algorithms in test environments: a) 10 arms and $\mu_i = i / 10$, where $i=0, \ldots, 9$ (the first column); b) 10 arms and $\mu_i = i / 50$, where $i=0, \ldots, 9$ (the second column); c) 100 arms and $\mu_i = i / 50$, where $i=0, \ldots, 99$ (the third column). In such test environments, UCB provides the best regret and the mean regret compared to alternatives while our algorithms converge to the same limit values of the mean regret.
Figure 4: (a) Comparison of the mean regret for the considered algorithms in Gaussian MAB with two arms with rewards $\{0,\Delta\}$. Our algorithms can distinguish arms with close rewards similar to the competitors. (b) Comparison of the mean regret for the considered algorithms in heavy tail MAB with five arms with rewards $\{0,0,0,0,\Delta\}.$ Our algorithms can distinguish arms with close rewards even if the noise is generated from the Cauchy distribution ($\gamma = 1$).
Figure 5: Regret and mean regret comparison for reward noise generated from Fréchet ($\alpha=1$) distribution.
...and 2 more figures

Theorems & Definitions (13)

Theorem 2: Convergence of FO-UCB
Theorem 3: Convergence of ZO-UCB
Remark 4
Definition 5
Theorem 8
Corollary 9
Theorem 10: Convergence of Clipped-SGD-UCB
Remark 11
Theorem 2: Convergence of FO-UCB
Theorem 3: Convergence of ZO-UCB
...and 3 more

Fast UCB-type algorithms for stochastic bandits with heavy and super heavy symmetric noise

TL;DR

Abstract

Fast UCB-type algorithms for stochastic bandits with heavy and super heavy symmetric noise

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (13)