Table of Contents
Fetching ...

Multi-Player Approaches for Dueling Bandits

Or Raveh, Junya Honda, Masashi Sugiyama

TL;DR

This work addresses learning in a distributed dueling-bandit setting with multiple cooperating players and a Condorcet Winner. It introduces two complementary algorithms: a Follow Your Leader Black Box (FYLBB) that can leverage any single-player dueling-bandit base algorithm, and a fully distributed message-passing RUCB (MP-RUCB) that uses CW recommendations to accelerate exploration. The authors prove an asymptotic lower bound of order $ ext{O}(K \, ext{log} \, T)$ that is independent of the number of players $M$, and they show both algorithms achieve matching upper bounds, with fast non-asymptotic CW identification in the distributed setting. Experiments on real preference data demonstrate that multiplayer approaches outperform single-player baselines, highlighting the gains from cooperative exploration and CW-driven communication in noninvasive, preference-based feedback scenarios such as ranking or tuning large models.

Abstract

Various approaches have emerged for multi-armed bandits in distributed systems. The multiplayer dueling bandit problem, common in scenarios with only preference-based information like human feedback, introduces challenges related to controlling collaborative exploration of non-informative arm pairs, but has received little attention. To fill this gap, we demonstrate that the direct use of a Follow Your Leader black-box approach matches the lower bound for this setting when utilizing known dueling bandit algorithms as a foundation. Additionally, we analyze a message-passing fully distributed approach with a novel Condorcet-winner recommendation protocol, resulting in expedited exploration in many cases. Our experimental comparisons reveal that our multiplayer algorithms surpass single-player benchmark algorithms, underscoring their efficacy in addressing the nuanced challenges of the multiplayer dueling bandit setting.

Multi-Player Approaches for Dueling Bandits

TL;DR

This work addresses learning in a distributed dueling-bandit setting with multiple cooperating players and a Condorcet Winner. It introduces two complementary algorithms: a Follow Your Leader Black Box (FYLBB) that can leverage any single-player dueling-bandit base algorithm, and a fully distributed message-passing RUCB (MP-RUCB) that uses CW recommendations to accelerate exploration. The authors prove an asymptotic lower bound of order that is independent of the number of players , and they show both algorithms achieve matching upper bounds, with fast non-asymptotic CW identification in the distributed setting. Experiments on real preference data demonstrate that multiplayer approaches outperform single-player baselines, highlighting the gains from cooperative exploration and CW-driven communication in noninvasive, preference-based feedback scenarios such as ranking or tuning large models.

Abstract

Various approaches have emerged for multi-armed bandits in distributed systems. The multiplayer dueling bandit problem, common in scenarios with only preference-based information like human feedback, introduces challenges related to controlling collaborative exploration of non-informative arm pairs, but has received little attention. To fill this gap, we demonstrate that the direct use of a Follow Your Leader black-box approach matches the lower bound for this setting when utilizing known dueling bandit algorithms as a foundation. Additionally, we analyze a message-passing fully distributed approach with a novel Condorcet-winner recommendation protocol, resulting in expedited exploration in many cases. Our experimental comparisons reveal that our multiplayer algorithms surpass single-player benchmark algorithms, underscoring their efficacy in addressing the nuanced challenges of the multiplayer dueling bandit setting.
Paper Structure (24 sections, 13 theorems, 99 equations, 14 figures, 3 algorithms)

This paper contains 24 sections, 13 theorems, 99 equations, 14 figures, 3 algorithms.

Key Result

Theorem 3.1

For any consistent algorithm on $\mathcal{Q}_{\mathrm{CW}}$ and $Q \in \mathcal{Q}_{\mathrm{CW}}$, the group regret obeys,

Figures (14)

  • Figure 1: Six Rankers
  • Figure 2: Sushi
  • Figure 3: Irish
  • Figure 5: Message Passing RUCB
  • Figure 6: Follow Your Leader RUCB
  • ...and 9 more figures

Theorems & Definitions (31)

  • Theorem 3.1
  • Theorem 4.2
  • Corollary 4.3
  • Lemma 5.1
  • Theorem 5.2
  • Remark 5.3
  • Definition C.1
  • Definition C.2
  • Definition C.4
  • Definition C.5
  • ...and 21 more