The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits

Sepehr Assadi; Chen Wang

The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits

Sepehr Assadi, Chen Wang

TL;DR

The paper studies pure exploration for multi-armed bandits in a streaming setting with sublinear memory and establishes a near-optimal lower bound on the number of passes required to achieve the classical $O\left(\frac{n}{\Delta^2}\right)$ sample complexity. By constructing a reverse-ordered multi-batch instance distribution and introducing memory- and batch-obliviousness notions, the authors develop an inductive, per-pass information-tracking framework that yields a $\Omega\left(\frac{\log(1/\Delta)}{\log\log(1/\Delta)}\right)$ lower bound on passes, matching known upper bounds up to a doubly-logarithmic factor. The work presents two core auxiliary lemmas (arm-trapping variants) and a two-case strategy (Conservative vs Radical) to bound learning under tight sample budgets, culminating in a proof of a tight pass-memory trade-off for streaming MABs. These results resolve an open question about the necessity of multiple passes under sublinear memory and illuminate how memory constraints fundamentally shape pure-exploration capabilities in large-scale bandit problems.

Abstract

We give a near-optimal sample-pass trade-off for pure exploration in multi-armed bandits (MABs) via multi-pass streaming algorithms: any streaming algorithm with sublinear memory that uses the optimal sample complexity of $O(\frac{n}{Δ^2})$ requires $Ω(\frac{\log{(1/Δ)}}{\log\log{(1/Δ)}})$ passes. Here, $n$ is the number of arms and $Δ$ is the reward gap between the best and the second-best arms. Our result matches the $O(\log(\frac{1}Δ))$-pass algorithm of Jin et al. [ICML'21] (up to lower order terms) that only uses $O(1)$ memory and answers an open question posed by Assadi and Wang [STOC'20].

The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits

TL;DR

sample complexity. By constructing a reverse-ordered multi-batch instance distribution and introducing memory- and batch-obliviousness notions, the authors develop an inductive, per-pass information-tracking framework that yields a

lower bound on passes, matching known upper bounds up to a doubly-logarithmic factor. The work presents two core auxiliary lemmas (arm-trapping variants) and a two-case strategy (Conservative vs Radical) to bound learning under tight sample budgets, culminating in a proof of a tight pass-memory trade-off for streaming MABs. These results resolve an open question about the necessity of multiple passes under sublinear memory and illuminate how memory constraints fundamentally shape pure-exploration capabilities in large-scale bandit problems.

Abstract

requires

passes. Here,

is the number of arms and

is the reward gap between the best and the second-best arms. Our result matches the

-pass algorithm of Jin et al. [ICML'21] (up to lower order terms) that only uses

memory and answers an open question posed by Assadi and Wang [STOC'20].

Paper Structure (32 sections, 13 theorems, 109 equations, 1 figure)

This paper contains 32 sections, 13 theorems, 109 equations, 1 figure.

Introduction
Our Techniques
Related Work
Preliminaries
Notation.
The Multi-pass Streaming MABs Model
Randomized algorithms.
Offline algorithms.
Standard Sample Complexity Lower Bounds for Single-armed Bandit
Main Result
Auxiliary Lemmas for Pure Exploration in MABs
Case A): the true reward of $\widetilde{\textnormal{arm}\xspace}\xspace$ is $1/2$.
Case B): the true reward of $\widetilde{\textnormal{arm}\xspace}\xspace$ is $1/2+\beta$.
The Multi-Pass Lower Bound: Proof of \ref{['thm:main']}
Additional Notation.
...and 17 more sections

Key Result

Lemma 2.1

Consider an arm with a Bernoulli distribution whose mean is parameterized as follows. where $\rho\in (0,\frac{1}{2}]$ is a fixed parameter. Any algorithm to determine the reward of the arm for $\beta\in (0, \frac{1}{6})$ and a success probability of at least $(1-\rho+\varepsilon)$ has to use $\frac{1}{4}\cdot \frac{\varepsilon^2}{\rho^2 \beta^{2}}$ arm pulls.

Figures (1)

Figure 1: An illustration of $\mathcal{D}(P,C)$. The indices of batches are arranged in the reversed order of the arrival of the stream. Batch $B_{P+1}$ always has an arm with $1/2+\eta_{P+1}$ mean reward, while other batches $p$ has its special arm with mean reward $1/2+\eta_{p}$ with probability $\frac{1}{2P}$.

Theorems & Definitions (34)

Lemma 2.1
Lemma 2.2
proof
Theorem 1
Corollary 3.2: Formalization of \ref{['rst:main-result']}
proof
Remark 3.3
Lemma 4.1: low-probability arm-trapping lemma
proof
Lemma 4.2: A sample-knowledge trade-off lemma
...and 24 more

The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits

TL;DR

Abstract

The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (34)