Table of Contents
Fetching ...

Multiplayer Information Asymmetric Contextual Bandits

William Chang, Yuanhao Lu

TL;DR

This work extends contextual bandits to a multiplayer setting with information asymmetry, where joint actions and shared or private rewards create new coordination challenges. It adapts the LinUCB framework to multiple agents through LinUCB-A and LinUCB-B and introduces an explore-then-commit (ETC) strategy for the fully asymmetric case, achieving $O(\sqrt{T})$ regret in the respective scenarios. The results demonstrate that, with appropriate coordination and parameter tuning, sublinear regret is attainable even under action or reward asymmetry, and also under both asymmetries via ETC. The findings broaden the scope of contextual bandits to distributed, partially observable decision-making with potential applications in multi-agent systems and distributed decision platforms.

Abstract

Single-player contextual bandits are a well-studied problem in reinforcement learning that has seen applications in various fields such as advertising, healthcare, and finance. In light of the recent work on \emph{information asymmetric} bandits \cite{chang2022online, chang2023online}, we propose a novel multiplayer information asymmetric contextual bandit framework where there are multiple players each with their own set of actions. At every round, they observe the same context vectors and simultaneously take an action from their own set of actions, giving rise to a joint action. However, upon taking this action the players are subjected to information asymmetry in (1) actions and/or (2) rewards. We designed an algorithm \texttt{LinUCB} by modifying the classical single-player algorithm \texttt{LinUCB} in \cite{chu2011contextual} to achieve the optimal regret $O(\sqrt{T})$ when only one kind of asymmetry is present. We then propose a novel algorithm \texttt{ETC} that is built on explore-then-commit principles to achieve the same optimal regret when both types of asymmetry are present.

Multiplayer Information Asymmetric Contextual Bandits

TL;DR

This work extends contextual bandits to a multiplayer setting with information asymmetry, where joint actions and shared or private rewards create new coordination challenges. It adapts the LinUCB framework to multiple agents through LinUCB-A and LinUCB-B and introduces an explore-then-commit (ETC) strategy for the fully asymmetric case, achieving regret in the respective scenarios. The results demonstrate that, with appropriate coordination and parameter tuning, sublinear regret is attainable even under action or reward asymmetry, and also under both asymmetries via ETC. The findings broaden the scope of contextual bandits to distributed, partially observable decision-making with potential applications in multi-agent systems and distributed decision platforms.

Abstract

Single-player contextual bandits are a well-studied problem in reinforcement learning that has seen applications in various fields such as advertising, healthcare, and finance. In light of the recent work on \emph{information asymmetric} bandits \cite{chang2022online, chang2023online}, we propose a novel multiplayer information asymmetric contextual bandit framework where there are multiple players each with their own set of actions. At every round, they observe the same context vectors and simultaneously take an action from their own set of actions, giving rise to a joint action. However, upon taking this action the players are subjected to information asymmetry in (1) actions and/or (2) rewards. We designed an algorithm \texttt{LinUCB} by modifying the classical single-player algorithm \texttt{LinUCB} in \cite{chu2011contextual} to achieve the optimal regret when only one kind of asymmetry is present. We then propose a novel algorithm \texttt{ETC} that is built on explore-then-commit principles to achieve the same optimal regret when both types of asymmetry are present.

Paper Structure

This paper contains 9 sections, 1 theorem, 13 equations, 2 algorithms.

Key Result

Theorem 2

In the action asymmetric (Problem A) contextual bandit setting where the context vectors, the frequentist regret bound of Algorithm algo:LinUCB-A is

Theorems & Definitions (3)

  • Definition 1
  • Theorem 2
  • proof