Table of Contents
Fetching ...

Combining Diverse Information for Coordinated Action: Stochastic Bandit Algorithms for Heterogeneous Agents

Lucia Gordon, Esther Rolf, Milind Tambe

TL;DR

This work introduces a UCB-style algorithm, Min-Width, which aggregates information from diverse agents, and addresses the joint challenges of aggregating the rewards, which follow different distributions for each agent-arm pair, and coordinating the assignments of agents to arms.

Abstract

Stochastic multi-agent multi-armed bandits typically assume that the rewards from each arm follow a fixed distribution, regardless of which agent pulls the arm. However, in many real-world settings, rewards can depend on the sensitivity of each agent to their environment. In medical screening, disease detection rates can vary by test type; in preference matching, rewards can depend on user preferences; and in environmental sensing, observation quality can vary across sensors. Since past work does not specify how to allocate agents of heterogeneous but known sensitivity of these types in a stochastic bandit setting, we introduce a UCB-style algorithm, Min-Width, which aggregates information from diverse agents. In doing so, we address the joint challenges of (i) aggregating the rewards, which follow different distributions for each agent-arm pair, and (ii) coordinating the assignments of agents to arms. Min-Width facilitates efficient collaboration among heterogeneous agents, exploiting the known structure in the agents' reward functions to weight their rewards accordingly. We analyze the regret of Min-Width and conduct pseudo-synthetic and fully synthetic experiments to study the performance of different levels of information sharing. Our results confirm that the gains to modeling agent heterogeneity tend to be greater when the sensitivities are more varied across agents, while combining more information does not always improve performance.

Combining Diverse Information for Coordinated Action: Stochastic Bandit Algorithms for Heterogeneous Agents

TL;DR

This work introduces a UCB-style algorithm, Min-Width, which aggregates information from diverse agents, and addresses the joint challenges of aggregating the rewards, which follow different distributions for each agent-arm pair, and coordinating the assignments of agents to arms.

Abstract

Stochastic multi-agent multi-armed bandits typically assume that the rewards from each arm follow a fixed distribution, regardless of which agent pulls the arm. However, in many real-world settings, rewards can depend on the sensitivity of each agent to their environment. In medical screening, disease detection rates can vary by test type; in preference matching, rewards can depend on user preferences; and in environmental sensing, observation quality can vary across sensors. Since past work does not specify how to allocate agents of heterogeneous but known sensitivity of these types in a stochastic bandit setting, we introduce a UCB-style algorithm, Min-Width, which aggregates information from diverse agents. In doing so, we address the joint challenges of (i) aggregating the rewards, which follow different distributions for each agent-arm pair, and (ii) coordinating the assignments of agents to arms. Min-Width facilitates efficient collaboration among heterogeneous agents, exploiting the known structure in the agents' reward functions to weight their rewards accordingly. We analyze the regret of Min-Width and conduct pseudo-synthetic and fully synthetic experiments to study the performance of different levels of information sharing. Our results confirm that the gains to modeling agent heterogeneity tend to be greater when the sensitivities are more varied across agents, while combining more information does not always improve performance.
Paper Structure (34 sections, 9 theorems, 149 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 34 sections, 9 theorems, 149 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

(Min-Width Weights Derivation). Suppose agent $a$ pulls arm $n$ a fixed number of times, a number we denote $c_{a,n}$, where the reward from each pull is $Y_{i,a,n}\sim\text{Bern}(s_a\mu_n)$. Let $\mathcal{C}_n=\{c_{a,n}\}_{a=1}^A$ contain the $c_{a,n}$ for every agent. Let $D_{\mathcal{C}_n,n}$ be Then the weights $w_{\mathcal{C}_n,a,n}$ that minimize the width of the confidence interval on $\mu

Figures (6)

  • Figure 1: Regret plotted over time for the COVID test allocation (left) and hotel recommendation (right) domains.
  • Figure 2: Regret plotted over time for the poaching prevention domain with varying agent sensitivities.
  • Figure B1: Regret plotted over time for Min-Width (M-W), Min-UCB (M-UCB), No-Sharing (N-S), CUCB, and UCB averaged over 300 trials with $\mu=\{0.1,0.5\}$ and $\mathcal{S}=\{0.1,0.9\}$.
  • Figure B2: Regret plotted over time for Min-Width (M-W), Min-UCB (M-UCB), No-Sharing (N-S), CUCB, and UCB averaged over 500 trials with $\mu=\{0.05,0.1,0.12,0.15,0.25,0.3\}$, $\mathcal{S}=\{0.8,0.8,0.8,0.95,0.95\}$, $\tilde{\mathcal{S}}=\{0.85,0.85,0.85,0.98,0.98\}$.
  • Figure B3: Regret plotted over time for Min-Width (M-W), Min-UCB (M-UCB), No-Sharing (N-S), CUCB, and UCB averaged over 500 trials with $\mu=\{0.05,0.1,0.12,0.15,0.25,0.3\}$, $\mathcal{S}=\{0.8,0.8,0.8,0.95,0.95\}$, $\tilde{\mathcal{S}}=\{0.75,0.75,0.75,0.9,0.9\}$.
  • ...and 1 more figures

Theorems & Definitions (18)

  • Proposition 1
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Proposition 2: No-Sharing Concentration Bound
  • proof
  • Corollary 1: Min-UCB Concentration Bound
  • proof
  • ...and 8 more