The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits
Sepehr Assadi, Chen Wang
TL;DR
The paper studies pure exploration for multi-armed bandits in a streaming setting with sublinear memory and establishes a near-optimal lower bound on the number of passes required to achieve the classical $O\left(\frac{n}{\Delta^2}\right)$ sample complexity. By constructing a reverse-ordered multi-batch instance distribution and introducing memory- and batch-obliviousness notions, the authors develop an inductive, per-pass information-tracking framework that yields a $\Omega\left(\frac{\log(1/\Delta)}{\log\log(1/\Delta)}\right)$ lower bound on passes, matching known upper bounds up to a doubly-logarithmic factor. The work presents two core auxiliary lemmas (arm-trapping variants) and a two-case strategy (Conservative vs Radical) to bound learning under tight sample budgets, culminating in a proof of a tight pass-memory trade-off for streaming MABs. These results resolve an open question about the necessity of multiple passes under sublinear memory and illuminate how memory constraints fundamentally shape pure-exploration capabilities in large-scale bandit problems.
Abstract
We give a near-optimal sample-pass trade-off for pure exploration in multi-armed bandits (MABs) via multi-pass streaming algorithms: any streaming algorithm with sublinear memory that uses the optimal sample complexity of $O(\frac{n}{Δ^2})$ requires $Ω(\frac{\log{(1/Δ)}}{\log\log{(1/Δ)}})$ passes. Here, $n$ is the number of arms and $Δ$ is the reward gap between the best and the second-best arms. Our result matches the $O(\log(\frac{1}Δ))$-pass algorithm of Jin et al. [ICML'21] (up to lower order terms) that only uses $O(1)$ memory and answers an open question posed by Assadi and Wang [STOC'20].
