Table of Contents
Fetching ...

Heterogeneous Multi-Agent Bandits with Parsimonious Hints

Amirmahdi Mirfakhar, Xuchuang Wang, Jinhang Zuo, Yair Zick, Mohammad Hajiesmaili

TL;DR

Lower bounds are established to prove the optimality of the results and verify them through numerical simulations, to achieve time-independent regret in HMA2B.

Abstract

We study a hinted heterogeneous multi-agent multi-armed bandits problem (HMA2B), where agents can query low-cost observations (hints) in addition to pulling arms. In this framework, each of the $M$ agents has a unique reward distribution over $K$ arms, and in $T$ rounds, they can observe the reward of the arm they pull only if no other agent pulls that arm. The goal is to maximize the total utility by querying the minimal necessary hints without pulling arms, achieving time-independent regret. We study HMA2B in both centralized and decentralized setups. Our main centralized algorithm, GP-HCLA, which is an extension of HCLA, uses a central decision-maker for arm-pulling and hint queries, achieving $O(M^4K)$ regret with $O(MK\log T)$ adaptive hints. In decentralized setups, we propose two algorithms, HD-ETC and EBHD-ETC, that allow agents to choose actions independently through collision-based communication and query hints uniformly until stopping, yielding $O(M^3K^2)$ regret with $O(M^3K\log T)$ hints, where the former requires knowledge of the minimum gap and the latter does not. Finally, we establish lower bounds to prove the optimality of our results and verify them through numerical simulations.

Heterogeneous Multi-Agent Bandits with Parsimonious Hints

TL;DR

Lower bounds are established to prove the optimality of the results and verify them through numerical simulations, to achieve time-independent regret in HMA2B.

Abstract

We study a hinted heterogeneous multi-agent multi-armed bandits problem (HMA2B), where agents can query low-cost observations (hints) in addition to pulling arms. In this framework, each of the agents has a unique reward distribution over arms, and in rounds, they can observe the reward of the arm they pull only if no other agent pulls that arm. The goal is to maximize the total utility by querying the minimal necessary hints without pulling arms, achieving time-independent regret. We study HMA2B in both centralized and decentralized setups. Our main centralized algorithm, GP-HCLA, which is an extension of HCLA, uses a central decision-maker for arm-pulling and hint queries, achieving regret with adaptive hints. In decentralized setups, we propose two algorithms, HD-ETC and EBHD-ETC, that allow agents to choose actions independently through collision-based communication and query hints uniformly until stopping, yielding regret with hints, where the former requires knowledge of the minimum gap and the latter does not. Finally, we establish lower bounds to prove the optimality of our results and verify them through numerical simulations.

Paper Structure

This paper contains 48 sections, 32 theorems, 93 equations, 5 figures, 1 table, 8 algorithms.

Key Result

Theorem 1

For $0< \delta < \frac{\Delta^\text{match}_{\min}}{2}$ and policy $\pi = \texttt{HCLA}$, the policy $\pi$ has where $\Delta^{\mathrm{kl}} = \mathrm{kl}(U(G^*;\bm{\mu}) - \Delta^\text{match}_{\min} + \delta, U(G^*;\bm{\mu}) - \delta)$.

Figures (5)

  • Figure 1: Set of covering matchings $\mathcal{R}$ for $M=3$ and $K=4$: $R_1$, $R_2$, $R_3$ and $R_4$ are depicted in (a), (b), (c) and (d).
  • Figure 2: Figure \ref{['fig:exp']} plots $R^\pi(T)$ and $R^{\pi_{\text{exp}}}(T)$ for both centralized and decentralized setups. Figures \ref{['fig:chints']} and \ref{['fig:dhints']} reflects the $L^\pi(T)$ for centralized and decentralized algorithms respectively and Figure \ref{['fig:dcom']} shows $R^{\pi_{\text{com}}}(T)$ for decentralized algorithms.
  • Figure 3: Figure \ref{['fig1:chints']} demonstrates that $\texttt{GP-HCLA}$ outperforms $\texttt{G-HCLA}$ in terms of $L^\pi(T)$.
  • Figure 4: Figures \ref{['fig2:chints']} and \ref{['fig2:exp']} illustrate the inefficiency of $\texttt{HCLA}$ when subjected to a slight increase in the size of the instance.
  • Figure 5: An $i$-cycle contained in $PG_{G^*}$

Theorems & Definitions (58)

  • Theorem 1
  • Theorem 2
  • Theorem 3: Necessity of Communication
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Definition 1
  • Lemma 1: wang2020optimal
  • Theorem 6
  • proof
  • ...and 48 more