Stochastic Bandits Robust to Adversarial Attacks

Xuchuang Wang; Jinhang Zuo; Xutong Liu; John C. S. Lui; Mohammad Hajiesmaili

Stochastic Bandits Robust to Adversarial Attacks

Xuchuang Wang, Jinhang Zuo, Xutong Liu, John C. S. Lui, Mohammad Hajiesmaili

TL;DR

We tackle stochastic MAB under a strong adversary who observes the pulled arm and perturbs the reward, introducing an attack budget $C$ and two knowledge regimes (known vs unknown $C$) with additive and multiplicative regret dependencies. The paper designs robust algorithms—SE-WR and SE-WR-Stop for known $C$, and PE-WR plus MS-SE-WR for unknown $C$—and proves tight upper bounds that scale gracefully with $C$ (e.g., $O\bigl( \sum_{k\neq k^*} \frac{\log T}{\Delta_k} + KC \bigr)$ and $\tilde{O}(\sqrt{KT}\!+\!KC^2)$ in various regimes, up to logarithmic factors). It also establishes matching lower bounds and a fundamental separation between attack and corruption models, showing that attacks can incur larger regret than corruptions under equivalent budgets. The results provide a comprehensive, nearly optimal toolkit for robust bandit learning under adversarial perturbations, with practical guidance on when additive vs multiplicative $C$-dependence is preferable. Together, these findings advance understanding of robustness in online learning under strong adversaries and quantify the cost of attacks in stochastic-bandit settings.

Abstract

This paper investigates stochastic multi-armed bandit algorithms that are robust to adversarial attacks, where an attacker can first observe the learner's action and {then} alter their reward observation. We study two cases of this model, with or without the knowledge of an attack budget $C$, defined as an upper bound of the summation of the difference between the actual and altered rewards. For both cases, we devise two types of algorithms with regret bounds having additive or multiplicative $C$ dependence terms. For the known attack budget case, we prove our algorithms achieve the regret bound of ${O}((K/Δ)\log T + KC)$ and $\tilde{O}(\sqrt{KTC})$ for the additive and multiplicative $C$ terms, respectively, where $K$ is the number of arms, $T$ is the time horizon, $Δ$ is the gap between the expected rewards of the optimal arm and the second-best arm, and $\tilde{O}$ hides the logarithmic factors. For the unknown case, we prove our algorithms achieve the regret bound of $\tilde{O}(\sqrt{KT} + KC^2)$ and $\tilde{O}(KC\sqrt{T})$ for the additive and multiplicative $C$ terms, respectively. In addition to these upper bound results, we provide several lower bounds showing the tightness of our bounds and the optimality of our algorithms. These results delineate an intrinsic separation between the bandits with attacks and corruption models [Lykouris et al., 2018].

Stochastic Bandits Robust to Adversarial Attacks

TL;DR

We tackle stochastic MAB under a strong adversary who observes the pulled arm and perturbs the reward, introducing an attack budget

and two knowledge regimes (known vs unknown

) with additive and multiplicative regret dependencies. The paper designs robust algorithms—SE-WR and SE-WR-Stop for known

, and PE-WR plus MS-SE-WR for unknown

—and proves tight upper bounds that scale gracefully with

(e.g.,

and

in various regimes, up to logarithmic factors). It also establishes matching lower bounds and a fundamental separation between attack and corruption models, showing that attacks can incur larger regret than corruptions under equivalent budgets. The results provide a comprehensive, nearly optimal toolkit for robust bandit learning under adversarial perturbations, with practical guidance on when additive vs multiplicative

-dependence is preferable. Together, these findings advance understanding of robustness in online learning under strong adversaries and quantify the cost of attacks in stochastic-bandit settings.

Abstract

, defined as an upper bound of the summation of the difference between the actual and altered rewards. For both cases, we devise two types of algorithms with regret bounds having additive or multiplicative

dependence terms. For the known attack budget case, we prove our algorithms achieve the regret bound of

and

for the additive and multiplicative

terms, respectively, where

is the number of arms,

is the time horizon,

is the gap between the expected rewards of the optimal arm and the second-best arm, and

hides the logarithmic factors. For the unknown case, we prove our algorithms achieve the regret bound of

and

for the additive and multiplicative

terms, respectively. In addition to these upper bound results, we provide several lower bounds showing the tightness of our bounds and the optimality of our algorithms. These results delineate an intrinsic separation between the bandits with attacks and corruption models [Lykouris et al., 2018].

Paper Structure (16 sections, 15 theorems, 36 equations, 2 figures, 1 table, 6 algorithms)

This paper contains 16 sections, 15 theorems, 36 equations, 2 figures, 1 table, 6 algorithms.

Introduction
Model: MAB with Adversarial Attacks
Lower Bounds
A General Lower Bound
Two Lower Bounds for Special Algorithm Classes
Algorithms with Known Attack Budget
SE-WR: An Algorithm with Gap-Dependent Upper Bound
SE-WR-Stop: An Algorithm with Gap-Independent Upper Bounds
Algorithms with Unknown Attack Budget
PE-WR: An Algorithm with Additive Upper Bound
MS-SE-WR: An Algorithm with Multiplicative Upper Bound
Other Model Details: Corruption Process, Regret Discussion
Proof for Lower Bound
Deferred Algorithm Pseudo-Code
Proof for Upper Bounds with Known Attack
...and 1 more sections

Key Result

Theorem 1

Given a stochastic multi-armed bandit game with $K$ arms, under attack with budget $C$, and $T>KC$ decision rounds,Note that $T\leqslant KC$ implies that $C=\Omega(T)$ which trivially results in a linear regret. for any bandits algorithm, there exists an attack policy with budget $C$ that can make t

Figures (2)

Figure 1: Algorithm design overview: only SE-WR has a gap-dependent bound; others are all gap-independent.
Figure 2: Comparison of unknown $C$ regrets (see Remark \ref{['remark:regret-bounds-comparison']} for detail)

Theorems & Definitions (24)

Theorem 1
Proposition 2
Theorem 3: Adapted from zuo2024near and he2022nearly
Proposition 4: Achievable additive regret bounds
Proposition 5: Achievable multiplicative regret bounds
Theorem 6
Theorem 7
Remark 8: Algorithm selection for known $C$ case
Theorem 9
Lemma 10
...and 14 more

Stochastic Bandits Robust to Adversarial Attacks

TL;DR

Abstract

Stochastic Bandits Robust to Adversarial Attacks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (24)