Nearly Tight Bounds for Exploration in Streaming Multi-armed Bandits with Known Optimality Gap

Nikolai Karpov; Chen Wang

Nearly Tight Bounds for Exploration in Streaming Multi-armed Bandits with Known Optimality Gap

Nikolai Karpov, Chen Wang

TL;DR

This work analyzes pure exploration in streaming multi-armed bandits with prior knowledge of the optimality gap $Δ_{[2]}$, focusing on the trade-offs between passes, sample complexity, and memory. It proves a sharp lower bound showing that any sublinear-memory algorithm with near instance-optimal sampling must make at least $Ω\left( \log n/\log\log n \right)$ passes, and provides a nearly matching elimination-based upper bound achieving $O\left( \log n \cdot \sum_{i=2}^n 1/Δ_i^2 \right)$ arm pulls with a single memory arm (for appropriately chosen passes). The results are extended to a parameterized family allowing a trade-off between passes and sample complexity, and a variant handles unknown $Δ_{[2]}$ with only additive overhead. Experiments across uniform, arithmetic progression, and clustered instance settings validate the theoretical findings, showing improved sample efficiency and reduced passes compared to strong baselines. Overall, the paper completes the theoretical picture for streaming MABs with known $Δ_{[2]}$, highlighting the practical viability of near instance-optimal, memory-efficient exploration in large-scale streaming settings.

Abstract

We investigate the sample-memory-pass trade-offs for pure exploration in multi-pass streaming multi-armed bandits (MABs) with the *a priori* knowledge of the optimality gap $Δ_{[2]}$. Here, and throughout, the optimality gap $Δ_{[i]}$ is defined as the mean reward gap between the best and the $i$-th best arms. A recent line of results by Jin, Huang, Tang, and Xiao [ICML'21] and Assadi and Wang [COLT'24] have shown that if there is no known $Δ_{[2]}$, a pass complexity of $Θ(\log(1/Δ_{[2]}))$ (up to $\log\log(1/Δ_{[2]})$ terms) is necessary and sufficient to obtain the *worst-case optimal* sample complexity of $O(n/Δ^{2}_{[2]})$ with a single-arm memory. However, our understanding of multi-pass algorithms with known $Δ_{[2]}$ is still limited. Here, the key open problem is how many passes are required to achieve the complexity, i.e., $O( \sum_{i=2}^{n}1/Δ^2_{[i]})$ arm pulls, with a sublinear memory size. In this work, we show that the ``right answer'' for the question is $Θ(\log{n})$ passes (up to $\log\log{n}$ terms). We first present a lower bound, showing that any algorithm that finds the best arm with slightly sublinear memory -- a memory of $o({n}/{\text{polylog}({n})})$ arms -- and $O(\sum_{i=2}^{n}{1}/{Δ^{2}_{[i]}}\cdot \log{(n)})$ arm pulls has to make $Ω(\frac{\log{n}}{\log\log{n}})$ passes over the stream. We then show a nearly-matching algorithm that assuming the knowledge of $Δ_{[2]}$, finds the best arm with $O( \sum_{i=2}^{n}1/Δ^2_{[i]} \cdot \log{n})$ arm pulls and a *single arm* memory.

Nearly Tight Bounds for Exploration in Streaming Multi-armed Bandits with Known Optimality Gap

TL;DR

This work analyzes pure exploration in streaming multi-armed bandits with prior knowledge of the optimality gap

, focusing on the trade-offs between passes, sample complexity, and memory. It proves a sharp lower bound showing that any sublinear-memory algorithm with near instance-optimal sampling must make at least

passes, and provides a nearly matching elimination-based upper bound achieving

arm pulls with a single memory arm (for appropriately chosen passes). The results are extended to a parameterized family allowing a trade-off between passes and sample complexity, and a variant handles unknown

with only additive overhead. Experiments across uniform, arithmetic progression, and clustered instance settings validate the theoretical findings, showing improved sample efficiency and reduced passes compared to strong baselines. Overall, the paper completes the theoretical picture for streaming MABs with known

, highlighting the practical viability of near instance-optimal, memory-efficient exploration in large-scale streaming settings.

Abstract

We investigate the sample-memory-pass trade-offs for pure exploration in multi-pass streaming multi-armed bandits (MABs) with the *a priori* knowledge of the optimality gap

. Here, and throughout, the optimality gap

is defined as the mean reward gap between the best and the

-th best arms. A recent line of results by Jin, Huang, Tang, and Xiao [ICML'21] and Assadi and Wang [COLT'24] have shown that if there is no known

, a pass complexity of

(up to

terms) is necessary and sufficient to obtain the *worst-case optimal* sample complexity of

with a single-arm memory. However, our understanding of multi-pass algorithms with known

is still limited. Here, the key open problem is how many passes are required to achieve the complexity, i.e.,

arm pulls, with a sublinear memory size. In this work, we show that the ``right answer'' for the question is

passes (up to

terms). We first present a lower bound, showing that any algorithm that finds the best arm with slightly sublinear memory -- a memory of

arms -- and

arm pulls has to make

passes over the stream. We then show a nearly-matching algorithm that assuming the knowledge of

, finds the best arm with

arm pulls and a *single arm* memory.

Nearly Tight Bounds for Exploration in Streaming Multi-armed Bandits with Known Optimality Gap

TL;DR

Abstract

Nearly Tight Bounds for Exploration in Streaming Multi-armed Bandits with Known Optimality Gap

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (52)