Table of Contents
Fetching ...

Efficient and Optimal Policy Gradient Algorithm for Corrupted Multi-armed Bandits

Jiayuan Liu, Siwei Wang, Zhixuan Fang

TL;DR

This work tackles stochastic multi-armed bandits under adversarial corruptions by applying SAMBA, a combinatorial policy-gradient algorithm. The authors prove a sharp regret bound of $O\left(\frac{K}{\Delta}\log T + \frac{C}{\Delta}\right)$, demonstrating asymptotic optimality and a favorable dependence on the corruption level $C$ while maintaining computational efficiency. The analysis leverages SAMBA's Markovian policy to bound the impact and recovery time from corruption, with a two-case treatment of the optimal arm's leadership. Empirical results show SAMBA outperforms baselines, including fast-slow AAE and BARBAR-type methods, and remains significantly faster than non-combinatorial online learning approaches. The findings position SAMBA as a state-of-the-art, scalable solution for corrupted bandits with practical implications for robust online decision-making.

Abstract

In this paper, we consider the stochastic multi-armed bandits problem with adversarial corruptions, where the random rewards of the arms are partially modified by an adversary to fool the algorithm. We apply the policy gradient algorithm SAMBA to this setting, and show that it is computationally efficient, and achieves a state-of-the-art $O(K\log T/Δ) + O(C/Δ)$ regret upper bound, where $K$ is the number of arms, $C$ is the unknown corruption level, $Δ$ is the minimum expected reward gap between the best arm and other ones, and $T$ is the time horizon. Compared with the best existing efficient algorithm (e.g., CBARBAR), whose regret upper bound is $O(K\log^2 T/Δ) + O(C)$, we show that SAMBA reduces one $\log T$ factor in the regret bound, while maintaining the corruption-dependent term to be linear with $C$. This is indeed asymptotically optimal. We also conduct simulations to demonstrate the effectiveness of SAMBA, and the results show that SAMBA outperforms existing baselines.

Efficient and Optimal Policy Gradient Algorithm for Corrupted Multi-armed Bandits

TL;DR

This work tackles stochastic multi-armed bandits under adversarial corruptions by applying SAMBA, a combinatorial policy-gradient algorithm. The authors prove a sharp regret bound of , demonstrating asymptotic optimality and a favorable dependence on the corruption level while maintaining computational efficiency. The analysis leverages SAMBA's Markovian policy to bound the impact and recovery time from corruption, with a two-case treatment of the optimal arm's leadership. Empirical results show SAMBA outperforms baselines, including fast-slow AAE and BARBAR-type methods, and remains significantly faster than non-combinatorial online learning approaches. The findings position SAMBA as a state-of-the-art, scalable solution for corrupted bandits with practical implications for robust online decision-making.

Abstract

In this paper, we consider the stochastic multi-armed bandits problem with adversarial corruptions, where the random rewards of the arms are partially modified by an adversary to fool the algorithm. We apply the policy gradient algorithm SAMBA to this setting, and show that it is computationally efficient, and achieves a state-of-the-art regret upper bound, where is the number of arms, is the unknown corruption level, is the minimum expected reward gap between the best arm and other ones, and is the time horizon. Compared with the best existing efficient algorithm (e.g., CBARBAR), whose regret upper bound is , we show that SAMBA reduces one factor in the regret bound, while maintaining the corruption-dependent term to be linear with . This is indeed asymptotically optimal. We also conduct simulations to demonstrate the effectiveness of SAMBA, and the results show that SAMBA outperforms existing baselines.

Paper Structure

This paper contains 20 sections, 7 theorems, 22 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Theorem 2

If constant $\alpha<\frac{\Delta}{r^*-\Delta}$, then the SAMBA algorithm for multi-armed bandits problem with adversarial corruption level $C$ ensures a regret

Figures (5)

  • Figure 1: Recovery process.
  • Figure 2: Consecutive corruptions.
  • Figure 3: Comparison of different algorithms: the cumulative regrets under different corruption levels and different corruption schemes. SAMBA achieves the lowest cumulative regret in most settings, particularly outperforming baselines when $C=0$, demonstrating its $O(\log T)$ regret versus $O(\log^2 T)$ for others. However, as corruption $C$ increases, SAMBA's advantage diminishes, consistent with its regret bound of $O(C + \log T)$, while OMD shows worse performance due to its high complexity and large constant factors.
  • Figure 4: Comparison of different algorithms: the trend of their cumulative regret with the time when $C=2000$ under corruption schemes 3 and 4.
  • Figure 5: An illustration of an embedded chain.

Theorems & Definitions (11)

  • Theorem 2
  • Definition 3: Recovery process
  • Remark 1
  • Definition 4: Recovery process
  • Definition 5: Embedded Chain
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • Lemma 9
  • Lemma 10
  • ...and 1 more