Table of Contents
Fetching ...

Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities

Hong Xie, Jinyu Mo, Defu Lian, Jie Wang, Enhong Chen

TL;DR

The paper studies distributed allocation of stochastic requests to $M$ arms by $K$ players without communication, where each arm experiences IID requests with distribution $\mathbf{P}$ and per-request rewards with mean $\boldsymbol{\mu}$. It introduces an optimal arm-pulling profile $\bm{n}^*$ that maximizes the total expected reward $U(\bm{n},\mathbf{P},\boldsymbol{\mu})$, and develops a polynomial-time greedy algorithm with complexity $O(KM)$ to locate one optimal profile, plus a constant-round iterative commitment procedure so players converge to the profile when it is unique. For the online/unknown setting, it adopts an explore-then-commit (ETC) framework: an exploration phase estimates $\mathbf{P}$ and $\boldsymbol{\mu}$, followed by a consensus mechanism that ensures all players agree on an optimal profile within $M$ rounds, after which the commit algorithm is applied. The paper provides a theoretical analysis showing logarithmic regret under appropriate exploration length and parameters, and supports the approach with empirical experiments validating efficiency and convergence. The contributions advance distributed decision-making for ridesharing-like applications where arrival patterns and rewards are stochastic and coordination is constraint-free.

Abstract

Motivated by distributed selection problems, we formulate a new variant of multi-player multi-armed bandit (MAB) model, which captures stochastic arrival of requests to each arm, as well as the policy of allocating requests to players. The challenge is how to design a distributed learning algorithm such that players select arms according to the optimal arm pulling profile (an arm pulling profile prescribes the number of players at each arm) without communicating to each other. We first design a greedy algorithm, which locates one of the optimal arm pulling profiles with a polynomial computational complexity. We also design an iterative distributed algorithm for players to commit to an optimal arm pulling profile with a constant number of rounds in expectation. We apply the explore then commit (ETC) framework to address the online setting when model parameters are unknown. We design an exploration strategy for players to estimate the optimal arm pulling profile. Since such estimates can be different across different players, it is challenging for players to commit. We then design an iterative distributed algorithm, which guarantees that players can arrive at a consensus on the optimal arm pulling profile in only M rounds. We conduct experiments to validate our algorithm.

Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities

TL;DR

The paper studies distributed allocation of stochastic requests to arms by players without communication, where each arm experiences IID requests with distribution and per-request rewards with mean . It introduces an optimal arm-pulling profile that maximizes the total expected reward , and develops a polynomial-time greedy algorithm with complexity to locate one optimal profile, plus a constant-round iterative commitment procedure so players converge to the profile when it is unique. For the online/unknown setting, it adopts an explore-then-commit (ETC) framework: an exploration phase estimates and , followed by a consensus mechanism that ensures all players agree on an optimal profile within rounds, after which the commit algorithm is applied. The paper provides a theoretical analysis showing logarithmic regret under appropriate exploration length and parameters, and supports the approach with empirical experiments validating efficiency and convergence. The contributions advance distributed decision-making for ridesharing-like applications where arrival patterns and rewards are stochastic and coordination is constraint-free.

Abstract

Motivated by distributed selection problems, we formulate a new variant of multi-player multi-armed bandit (MAB) model, which captures stochastic arrival of requests to each arm, as well as the policy of allocating requests to players. The challenge is how to design a distributed learning algorithm such that players select arms according to the optimal arm pulling profile (an arm pulling profile prescribes the number of players at each arm) without communicating to each other. We first design a greedy algorithm, which locates one of the optimal arm pulling profiles with a polynomial computational complexity. We also design an iterative distributed algorithm for players to commit to an optimal arm pulling profile with a constant number of rounds in expectation. We apply the explore then commit (ETC) framework to address the online setting when model parameters are unknown. We design an exploration strategy for players to estimate the optimal arm pulling profile. Since such estimates can be different across different players, it is challenging for players to commit. We then design an iterative distributed algorithm, which guarantees that players can arrive at a consensus on the optimal arm pulling profile in only M rounds. We conduct experiments to validate our algorithm.
Paper Structure (1 section)

This paper contains 1 section.

Table of Contents

  1. Introduction