Bandit Social Learning with Exploration Episodes
Kiarash Banihashem, Natalie Collina, Aleksandrs Slivkins
TL;DR
The paper analyzes Bandit Social Learning with Exploration Episodes (EpiBSL), a model in which self-interested agents control an episode of length $m\ge 2$ within a two-arm Bayesian bandit with a skip option and a per-round exploration cost $c_{\text{expl}}$. Each episode yields a score via a symmetric, non-decreasing aggregation function $f:\{0,1\}^m\to[0,m]$, and agents select Bayesian-optimal per-episode policies based on Beta-Bernoulli posteriors. The authors show that, for any fixed $m\ge 2$ and symmetric $f$, there exist problem-dependent constants such that, when $c_{\text{expl}}$ is sufficiently small, a learning-failure event $\mathtt{FAIL}_{c_{\mathcal{P}},N_{\mathcal{P}}}$ occurs with positive probability, and this implies linear Bayesian regret: $\mathsf{BReg}(T) \ge c_{0}(T-N_{0})$ for large $T$. They provide explicit utility-gap bounds for common aggregation functions (e.g., $f=\min$, $f=\max$) and extend results to the $m=2$ non-symmetric case, establishing that endogenous within-episode exploration cannot suffice to guarantee sublinear regret. The work highlights the need for external interventions or incentive schemes to sustain exploration in social-learning settings and connects Bayesian bandit theory with strategic exploration in episodic, self-interested environments.
Abstract
We study a stylized social learning dynamics where self-interested agents collectively follow a simple multi-armed bandit protocol. Each agent controls an ``episode": a short sequence of consecutive decisions. Motivating applications include users repeatedly interacting with an AI, or repeatedly shopping at a marketplace. While agents are incentivized to explore within their respective episodes, we show that the aggregate exploration fails: e.g., its Bayesian regret grows linearly over time. In fact, such failure is a (very) typical case, not just a worst-case scenario. This conclusion persists even if an agent's per-episode utility is some fixed function of the per-round outcomes: e.g., $\min$ or $\max$, not just the sum. Thus, externally driven exploration is needed even when some amount of exploration happens organically.
