Table of Contents
Fetching ...

Bandit Social Learning with Exploration Episodes

Kiarash Banihashem, Natalie Collina, Aleksandrs Slivkins

TL;DR

The paper analyzes Bandit Social Learning with Exploration Episodes (EpiBSL), a model in which self-interested agents control an episode of length $m\ge 2$ within a two-arm Bayesian bandit with a skip option and a per-round exploration cost $c_{\text{expl}}$. Each episode yields a score via a symmetric, non-decreasing aggregation function $f:\{0,1\}^m\to[0,m]$, and agents select Bayesian-optimal per-episode policies based on Beta-Bernoulli posteriors. The authors show that, for any fixed $m\ge 2$ and symmetric $f$, there exist problem-dependent constants such that, when $c_{\text{expl}}$ is sufficiently small, a learning-failure event $\mathtt{FAIL}_{c_{\mathcal{P}},N_{\mathcal{P}}}$ occurs with positive probability, and this implies linear Bayesian regret: $\mathsf{BReg}(T) \ge c_{0}(T-N_{0})$ for large $T$. They provide explicit utility-gap bounds for common aggregation functions (e.g., $f=\min$, $f=\max$) and extend results to the $m=2$ non-symmetric case, establishing that endogenous within-episode exploration cannot suffice to guarantee sublinear regret. The work highlights the need for external interventions or incentive schemes to sustain exploration in social-learning settings and connects Bayesian bandit theory with strategic exploration in episodic, self-interested environments.

Abstract

We study a stylized social learning dynamics where self-interested agents collectively follow a simple multi-armed bandit protocol. Each agent controls an ``episode": a short sequence of consecutive decisions. Motivating applications include users repeatedly interacting with an AI, or repeatedly shopping at a marketplace. While agents are incentivized to explore within their respective episodes, we show that the aggregate exploration fails: e.g., its Bayesian regret grows linearly over time. In fact, such failure is a (very) typical case, not just a worst-case scenario. This conclusion persists even if an agent's per-episode utility is some fixed function of the per-round outcomes: e.g., $\min$ or $\max$, not just the sum. Thus, externally driven exploration is needed even when some amount of exploration happens organically.

Bandit Social Learning with Exploration Episodes

TL;DR

The paper analyzes Bandit Social Learning with Exploration Episodes (EpiBSL), a model in which self-interested agents control an episode of length within a two-arm Bayesian bandit with a skip option and a per-round exploration cost . Each episode yields a score via a symmetric, non-decreasing aggregation function , and agents select Bayesian-optimal per-episode policies based on Beta-Bernoulli posteriors. The authors show that, for any fixed and symmetric , there exist problem-dependent constants such that, when is sufficiently small, a learning-failure event occurs with positive probability, and this implies linear Bayesian regret: for large . They provide explicit utility-gap bounds for common aggregation functions (e.g., , ) and extend results to the non-symmetric case, establishing that endogenous within-episode exploration cannot suffice to guarantee sublinear regret. The work highlights the need for external interventions or incentive schemes to sustain exploration in social-learning settings and connects Bayesian bandit theory with strategic exploration in episodic, self-interested environments.

Abstract

We study a stylized social learning dynamics where self-interested agents collectively follow a simple multi-armed bandit protocol. Each agent controls an ``episode": a short sequence of consecutive decisions. Motivating applications include users repeatedly interacting with an AI, or repeatedly shopping at a marketplace. While agents are incentivized to explore within their respective episodes, we show that the aggregate exploration fails: e.g., its Bayesian regret grows linearly over time. In fact, such failure is a (very) typical case, not just a worst-case scenario. This conclusion persists even if an agent's per-episode utility is some fixed function of the per-round outcomes: e.g., or , not just the sum. Thus, externally driven exploration is needed even when some amount of exploration happens organically.
Paper Structure (31 sections, 23 theorems, 125 equations)

This paper contains 31 sections, 23 theorems, 125 equations.

Key Result

Lemma 3.4

[lemma]thm:weak-to-strong Fix $c\in(0,1/2)$ and $N\in\mathbb{N}$. Consider an $\mathtt{EpiBSL}$ instance such that $\delta:={\textnormal{Pr} \left[\mathtt{FAIL}\xspace_{c,N}\right]}>0$. Then for some $N'<\mathbb{N}$ determined by $c,N,\delta$ and length $m$.

Theorems & Definitions (60)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Lemma 3.4
  • proof : Proof Sketch
  • Definition 3.5
  • Theorem 3.6
  • Remark 3.7
  • Lemma 3.8
  • proof : Proof Sketch
  • ...and 50 more