Near-Optimal Regret for Distributed Adversarial Bandits: A Black-Box Approach
Hao Qiu, Mengxiao Zhang, Nicolò Cesa-Bianchi
TL;DR
This work addresses distributed adversarial bandits with gossip communication, establishing near-optimal minimax regret bounds that separately capture communication and bandit information costs. The authors introduce a block-based learning scheme that decouples learning from communication via a black-box reduction to delayed feedback, enabling the transfer of delayed-feedback guarantees to the distributed setting. They prove a minimax regret of tildeΘ(sqrt((ρ^{-1/2}+K/N)T)) for K-armed bandits and extend the approach to distributed linear bandits with regret tildeΘ(sqrt((ρ^{-1/2}+1/N)dT)), all achievable with gossip-only communication. The framework is complemented by adaptive bounds (small-loss and best-of-both-worlds) and a matching lower bound decomposing the difficulty into a communication term ρ^{-1/4} sqrt(T) and a bandit term sqrt(KT/N). A further extension to high-dimensional linear bandits uses a volumetric spanner to maintain O(d) communication per round, with a corresponding lower bound, highlighting the near-optimal trade-offs between network topology, exploration, and information exchange in distributed adversarial learning.
Abstract
We study distributed adversarial bandits, where $N$ agents cooperate to minimize the global average loss while observing only their own local losses. We show that the minimax regret for this problem is $\tildeΘ(\sqrt{(ρ^{-1/2}+K/N)T})$, where $T$ is the horizon, $K$ is the number of actions, and $ρ$ is the spectral gap of the communication matrix. Our algorithm, based on a novel black-box reduction to bandits with delayed feedback, requires agents to communicate only through gossip. It achieves an upper bound that significantly improves over the previous best bound $\tilde{O}(ρ^{-1/3}(KT)^{2/3})$ of Yi and Vojnovic (2023). We complement this result with a matching lower bound, showing that the problem's difficulty decomposes into a communication cost $ρ^{-1/4}\sqrt{T}$ and a bandit cost $\sqrt{KT/N}$. We further demonstrate the versatility of our approach by deriving first-order and best-of-both-worlds bounds in the distributed adversarial setting. Finally, we extend our framework to distributed linear bandits in $R^d$, obtaining a regret bound of $\tilde{O}(\sqrt{(ρ^{-1/2}+1/N)dT})$, achieved with only $O(d)$ communication cost per agent and per round via a volumetric spanner.
