Multi-Player Approaches for Dueling Bandits
Or Raveh, Junya Honda, Masashi Sugiyama
TL;DR
This work addresses learning in a distributed dueling-bandit setting with multiple cooperating players and a Condorcet Winner. It introduces two complementary algorithms: a Follow Your Leader Black Box (FYLBB) that can leverage any single-player dueling-bandit base algorithm, and a fully distributed message-passing RUCB (MP-RUCB) that uses CW recommendations to accelerate exploration. The authors prove an asymptotic lower bound of order $ ext{O}(K \, ext{log} \, T)$ that is independent of the number of players $M$, and they show both algorithms achieve matching upper bounds, with fast non-asymptotic CW identification in the distributed setting. Experiments on real preference data demonstrate that multiplayer approaches outperform single-player baselines, highlighting the gains from cooperative exploration and CW-driven communication in noninvasive, preference-based feedback scenarios such as ranking or tuning large models.
Abstract
Various approaches have emerged for multi-armed bandits in distributed systems. The multiplayer dueling bandit problem, common in scenarios with only preference-based information like human feedback, introduces challenges related to controlling collaborative exploration of non-informative arm pairs, but has received little attention. To fill this gap, we demonstrate that the direct use of a Follow Your Leader black-box approach matches the lower bound for this setting when utilizing known dueling bandit algorithms as a foundation. Additionally, we analyze a message-passing fully distributed approach with a novel Condorcet-winner recommendation protocol, resulting in expedited exploration in many cases. Our experimental comparisons reveal that our multiplayer algorithms surpass single-player benchmark algorithms, underscoring their efficacy in addressing the nuanced challenges of the multiplayer dueling bandit setting.
