Table of Contents
Fetching ...

The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits

Sepehr Assadi, Chen Wang

TL;DR

The paper studies pure exploration for multi-armed bandits in a streaming setting with sublinear memory and establishes a near-optimal lower bound on the number of passes required to achieve the classical $O\left(\frac{n}{\Delta^2}\right)$ sample complexity. By constructing a reverse-ordered multi-batch instance distribution and introducing memory- and batch-obliviousness notions, the authors develop an inductive, per-pass information-tracking framework that yields a $\Omega\left(\frac{\log(1/\Delta)}{\log\log(1/\Delta)}\right)$ lower bound on passes, matching known upper bounds up to a doubly-logarithmic factor. The work presents two core auxiliary lemmas (arm-trapping variants) and a two-case strategy (Conservative vs Radical) to bound learning under tight sample budgets, culminating in a proof of a tight pass-memory trade-off for streaming MABs. These results resolve an open question about the necessity of multiple passes under sublinear memory and illuminate how memory constraints fundamentally shape pure-exploration capabilities in large-scale bandit problems.

Abstract

We give a near-optimal sample-pass trade-off for pure exploration in multi-armed bandits (MABs) via multi-pass streaming algorithms: any streaming algorithm with sublinear memory that uses the optimal sample complexity of $O(\frac{n}{Δ^2})$ requires $Ω(\frac{\log{(1/Δ)}}{\log\log{(1/Δ)}})$ passes. Here, $n$ is the number of arms and $Δ$ is the reward gap between the best and the second-best arms. Our result matches the $O(\log(\frac{1}Δ))$-pass algorithm of Jin et al. [ICML'21] (up to lower order terms) that only uses $O(1)$ memory and answers an open question posed by Assadi and Wang [STOC'20].

The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits

TL;DR

The paper studies pure exploration for multi-armed bandits in a streaming setting with sublinear memory and establishes a near-optimal lower bound on the number of passes required to achieve the classical sample complexity. By constructing a reverse-ordered multi-batch instance distribution and introducing memory- and batch-obliviousness notions, the authors develop an inductive, per-pass information-tracking framework that yields a lower bound on passes, matching known upper bounds up to a doubly-logarithmic factor. The work presents two core auxiliary lemmas (arm-trapping variants) and a two-case strategy (Conservative vs Radical) to bound learning under tight sample budgets, culminating in a proof of a tight pass-memory trade-off for streaming MABs. These results resolve an open question about the necessity of multiple passes under sublinear memory and illuminate how memory constraints fundamentally shape pure-exploration capabilities in large-scale bandit problems.

Abstract

We give a near-optimal sample-pass trade-off for pure exploration in multi-armed bandits (MABs) via multi-pass streaming algorithms: any streaming algorithm with sublinear memory that uses the optimal sample complexity of requires passes. Here, is the number of arms and is the reward gap between the best and the second-best arms. Our result matches the -pass algorithm of Jin et al. [ICML'21] (up to lower order terms) that only uses memory and answers an open question posed by Assadi and Wang [STOC'20].
Paper Structure (32 sections, 13 theorems, 109 equations, 1 figure)

This paper contains 32 sections, 13 theorems, 109 equations, 1 figure.

Key Result

Lemma 2.1

Consider an arm with a Bernoulli distribution whose mean is parameterized as follows. where $\rho\in (0,\frac{1}{2}]$ is a fixed parameter. Any algorithm to determine the reward of the arm for $\beta\in (0, \frac{1}{6})$ and a success probability of at least $(1-\rho+\varepsilon)$ has to use $\frac{1}{4}\cdot \frac{\varepsilon^2}{\rho^2 \beta^{2}}$ arm pulls.

Figures (1)

  • Figure 1: An illustration of $\mathcal{D}(P,C)$. The indices of batches are arranged in the reversed order of the arrival of the stream. Batch $B_{P+1}$ always has an arm with $1/2+\eta_{P+1}$ mean reward, while other batches $p$ has its special arm with mean reward $1/2+\eta_{p}$ with probability $\frac{1}{2P}$.

Theorems & Definitions (34)

  • Lemma 2.1
  • Lemma 2.2
  • proof
  • Theorem 1
  • Corollary 3.2: Formalization of \ref{['rst:main-result']}
  • proof
  • Remark 3.3
  • Lemma 4.1: low-probability arm-trapping lemma
  • proof
  • Lemma 4.2: A sample-knowledge trade-off lemma
  • ...and 24 more