Table of Contents
Fetching ...

Heterogeneous Multi-Player Multi-Armed Bandits Robust To Adversarial Attacks

Akshayaa Magesh, Venugopal V. Veeravalli

TL;DR

The paper tackles robust decision-making in a decentralized, heterogeneous multi-player multi-armed bandit setting under adversarial attacks that can nullify rewards. It introduces an epoch-based policy with Exploration, Matching, and Exploitation phases and leverages one-bit communication rounds to maintain coordination without full observability. The core innovation combines a payoff-based matching phase with a perturbed Markov chain analysis to ensure convergence to an efficient action profile, yielding a regret bound of $R(T) = O(\log^{1+\delta} T + W)$. The results demonstrate sublinear regret growth despite adversarial disruptions, highlighting practical impact for dynamic spectrum access and other distributed resource allocation tasks under adversarial conditions.

Abstract

We consider a multi-player multi-armed bandit setting in the presence of adversaries that attempt to negatively affect the rewards received by the players in the system. The reward distributions for any given arm are heterogeneous across the players. In the event of a collision (more than one player choosing the same arm), all the colliding users receive zero rewards. The adversaries use collisions to affect the rewards received by the players, i.e., if an adversary attacks an arm, any player choosing that arm will receive zero reward. At any time step, the adversaries may attack more than one arm. It is assumed that the players in the system do not deviate from a pre-determined policy used by all the players, and that the probability that none of the arms face adversarial attacks is strictly positive at every time step. In order to combat the adversarial attacks, the players are allowed to communicate using a single bit for $O(\log T)$ time units, where $T$ is the time horizon, and each player can only observe their own actions and rewards at all time steps. We propose a {policy that is used by all the players, which} achieves near order optimal regret of order $O(\log^{1+δ}T + W)$, where $W$ is total number of time units for which there was an adversarial attack on at least one arm.

Heterogeneous Multi-Player Multi-Armed Bandits Robust To Adversarial Attacks

TL;DR

The paper tackles robust decision-making in a decentralized, heterogeneous multi-player multi-armed bandit setting under adversarial attacks that can nullify rewards. It introduces an epoch-based policy with Exploration, Matching, and Exploitation phases and leverages one-bit communication rounds to maintain coordination without full observability. The core innovation combines a payoff-based matching phase with a perturbed Markov chain analysis to ensure convergence to an efficient action profile, yielding a regret bound of . The results demonstrate sublinear regret growth despite adversarial disruptions, highlighting practical impact for dynamic spectrum access and other distributed resource allocation tasks under adversarial conditions.

Abstract

We consider a multi-player multi-armed bandit setting in the presence of adversaries that attempt to negatively affect the rewards received by the players in the system. The reward distributions for any given arm are heterogeneous across the players. In the event of a collision (more than one player choosing the same arm), all the colliding users receive zero rewards. The adversaries use collisions to affect the rewards received by the players, i.e., if an adversary attacks an arm, any player choosing that arm will receive zero reward. At any time step, the adversaries may attack more than one arm. It is assumed that the players in the system do not deviate from a pre-determined policy used by all the players, and that the probability that none of the arms face adversarial attacks is strictly positive at every time step. In order to combat the adversarial attacks, the players are allowed to communicate using a single bit for time units, where is the time horizon, and each player can only observe their own actions and rewards at all time steps. We propose a {policy that is used by all the players, which} achieves near order optimal regret of order , where is total number of time units for which there was an adversarial attack on at least one arm.

Paper Structure

This paper contains 13 sections, 11 theorems, 44 equations, 1 figure, 3 algorithms.

Key Result

Theorem 1

Given the system model specified in Section sec:system_model, the expected regret of the proposed Algorithm alg:main for a time-horizon $T$ and some $0< \delta <1$ is $R(T) = O(\log^{1+\delta} T + W)$, where $W$ is total number of time units for which there was an adversarial attack on at least one

Figures (1)

  • Figure 1: Average accumulated regret as a function of time

Theorems & Definitions (19)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • proof
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • ...and 9 more