Bandit Social Learning with Exploration Episodes

Kiarash Banihashem; Natalie Collina; Aleksandrs Slivkins

Bandit Social Learning with Exploration Episodes

Kiarash Banihashem, Natalie Collina, Aleksandrs Slivkins

TL;DR

The paper analyzes Bandit Social Learning with Exploration Episodes (EpiBSL), a model in which self-interested agents control an episode of length $m\ge 2$ within a two-arm Bayesian bandit with a skip option and a per-round exploration cost $c_{\text{expl}}$. Each episode yields a score via a symmetric, non-decreasing aggregation function $f:\{0,1\}^m\to[0,m]$, and agents select Bayesian-optimal per-episode policies based on Beta-Bernoulli posteriors. The authors show that, for any fixed $m\ge 2$ and symmetric $f$, there exist problem-dependent constants such that, when $c_{\text{expl}}$ is sufficiently small, a learning-failure event $\mathtt{FAIL}_{c_{\mathcal{P}},N_{\mathcal{P}}}$ occurs with positive probability, and this implies linear Bayesian regret: $\mathsf{BReg}(T) \ge c_{0}(T-N_{0})$ for large $T$. They provide explicit utility-gap bounds for common aggregation functions (e.g., $f=\min$, $f=\max$) and extend results to the $m=2$ non-symmetric case, establishing that endogenous within-episode exploration cannot suffice to guarantee sublinear regret. The work highlights the need for external interventions or incentive schemes to sustain exploration in social-learning settings and connects Bayesian bandit theory with strategic exploration in episodic, self-interested environments.

Abstract

We study a stylized social learning dynamics where self-interested agents collectively follow a simple multi-armed bandit protocol. Each agent controls an ``episode": a short sequence of consecutive decisions. Motivating applications include users repeatedly interacting with an AI, or repeatedly shopping at a marketplace. While agents are incentivized to explore within their respective episodes, we show that the aggregate exploration fails: e.g., its Bayesian regret grows linearly over time. In fact, such failure is a (very) typical case, not just a worst-case scenario. This conclusion persists even if an agent's per-episode utility is some fixed function of the per-round outcomes: e.g., $\min$ or $\max$, not just the sum. Thus, externally driven exploration is needed even when some amount of exploration happens organically.

Bandit Social Learning with Exploration Episodes

TL;DR

The paper analyzes Bandit Social Learning with Exploration Episodes (EpiBSL), a model in which self-interested agents control an episode of length

within a two-arm Bayesian bandit with a skip option and a per-round exploration cost

. Each episode yields a score via a symmetric, non-decreasing aggregation function

, and agents select Bayesian-optimal per-episode policies based on Beta-Bernoulli posteriors. The authors show that, for any fixed

and symmetric

, there exist problem-dependent constants such that, when

is sufficiently small, a learning-failure event

occurs with positive probability, and this implies linear Bayesian regret:

for large

. They provide explicit utility-gap bounds for common aggregation functions (e.g.,

) and extend results to the

non-symmetric case, establishing that endogenous within-episode exploration cannot suffice to guarantee sublinear regret. The work highlights the need for external interventions or incentive schemes to sustain exploration in social-learning settings and connects Bayesian bandit theory with strategic exploration in episodic, self-interested environments.

Abstract

, not just the sum. Thus, externally driven exploration is needed even when some amount of exploration happens organically.

Paper Structure (31 sections, 23 theorems, 125 equations)

This paper contains 31 sections, 23 theorems, 125 equations.

Introduction
Related Work
Model and Preliminaries
Learning Failures and Regret
Failure Results and Techniques
Proof Sketches
Proofs for \ref{['sec:failure']}: Failures and Regret
Proof of \ref{['thm:weak-to-strong']}: from $\mathtt{FAIL}\xspace_{c,N}$ to strong failure
Proof of \ref{['lm:util-gap']}: utility-gap
Lower bounds on utility-gap in special cases: proof of \ref{['lm:special_case_gap']}
Proof of \ref{['thm:m_2_symm']}(a): $m=2$, Arbitrary Cost
Proof of \ref{['lm:no_pull']}
Case I.
Case II.
Case III.
...and 16 more sections

Key Result

Lemma 3.4

[lemma]thm:weak-to-strong Fix $c\in(0,1/2)$ and $N\in\mathbb{N}$. Consider an $\mathtt{EpiBSL}$ instance such that $\delta:={\textnormal{Pr} \left[\mathtt{FAIL}\xspace_{c,N}\right]}>0$. Then for some $N'<\mathbb{N}$ determined by $c,N,\delta$ and length $m$.

Theorems & Definitions (60)

Definition 3.1
Definition 3.2
Definition 3.3
Lemma 3.4
proof : Proof Sketch
Definition 3.5
Theorem 3.6
Remark 3.7
Lemma 3.8
proof : Proof Sketch
...and 50 more

Bandit Social Learning with Exploration Episodes

TL;DR

Abstract

Bandit Social Learning with Exploration Episodes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (60)