Table of Contents
Fetching ...

Nearly Tight Bounds for Exploration in Streaming Multi-armed Bandits with Known Optimality Gap

Nikolai Karpov, Chen Wang

TL;DR

This work analyzes pure exploration in streaming multi-armed bandits with prior knowledge of the optimality gap $Δ_{[2]}$, focusing on the trade-offs between passes, sample complexity, and memory. It proves a sharp lower bound showing that any sublinear-memory algorithm with near instance-optimal sampling must make at least $Ω\left( \log n/\log\log n \right)$ passes, and provides a nearly matching elimination-based upper bound achieving $O\left( \log n \cdot \sum_{i=2}^n 1/Δ_i^2 \right)$ arm pulls with a single memory arm (for appropriately chosen passes). The results are extended to a parameterized family allowing a trade-off between passes and sample complexity, and a variant handles unknown $Δ_{[2]}$ with only additive overhead. Experiments across uniform, arithmetic progression, and clustered instance settings validate the theoretical findings, showing improved sample efficiency and reduced passes compared to strong baselines. Overall, the paper completes the theoretical picture for streaming MABs with known $Δ_{[2]}$, highlighting the practical viability of near instance-optimal, memory-efficient exploration in large-scale streaming settings.

Abstract

We investigate the sample-memory-pass trade-offs for pure exploration in multi-pass streaming multi-armed bandits (MABs) with the *a priori* knowledge of the optimality gap $Δ_{[2]}$. Here, and throughout, the optimality gap $Δ_{[i]}$ is defined as the mean reward gap between the best and the $i$-th best arms. A recent line of results by Jin, Huang, Tang, and Xiao [ICML'21] and Assadi and Wang [COLT'24] have shown that if there is no known $Δ_{[2]}$, a pass complexity of $Θ(\log(1/Δ_{[2]}))$ (up to $\log\log(1/Δ_{[2]})$ terms) is necessary and sufficient to obtain the *worst-case optimal* sample complexity of $O(n/Δ^{2}_{[2]})$ with a single-arm memory. However, our understanding of multi-pass algorithms with known $Δ_{[2]}$ is still limited. Here, the key open problem is how many passes are required to achieve the complexity, i.e., $O( \sum_{i=2}^{n}1/Δ^2_{[i]})$ arm pulls, with a sublinear memory size. In this work, we show that the ``right answer'' for the question is $Θ(\log{n})$ passes (up to $\log\log{n}$ terms). We first present a lower bound, showing that any algorithm that finds the best arm with slightly sublinear memory -- a memory of $o({n}/{\text{polylog}({n})})$ arms -- and $O(\sum_{i=2}^{n}{1}/{Δ^{2}_{[i]}}\cdot \log{(n)})$ arm pulls has to make $Ω(\frac{\log{n}}{\log\log{n}})$ passes over the stream. We then show a nearly-matching algorithm that assuming the knowledge of $Δ_{[2]}$, finds the best arm with $O( \sum_{i=2}^{n}1/Δ^2_{[i]} \cdot \log{n})$ arm pulls and a *single arm* memory.

Nearly Tight Bounds for Exploration in Streaming Multi-armed Bandits with Known Optimality Gap

TL;DR

This work analyzes pure exploration in streaming multi-armed bandits with prior knowledge of the optimality gap , focusing on the trade-offs between passes, sample complexity, and memory. It proves a sharp lower bound showing that any sublinear-memory algorithm with near instance-optimal sampling must make at least passes, and provides a nearly matching elimination-based upper bound achieving arm pulls with a single memory arm (for appropriately chosen passes). The results are extended to a parameterized family allowing a trade-off between passes and sample complexity, and a variant handles unknown with only additive overhead. Experiments across uniform, arithmetic progression, and clustered instance settings validate the theoretical findings, showing improved sample efficiency and reduced passes compared to strong baselines. Overall, the paper completes the theoretical picture for streaming MABs with known , highlighting the practical viability of near instance-optimal, memory-efficient exploration in large-scale streaming settings.

Abstract

We investigate the sample-memory-pass trade-offs for pure exploration in multi-pass streaming multi-armed bandits (MABs) with the *a priori* knowledge of the optimality gap . Here, and throughout, the optimality gap is defined as the mean reward gap between the best and the -th best arms. A recent line of results by Jin, Huang, Tang, and Xiao [ICML'21] and Assadi and Wang [COLT'24] have shown that if there is no known , a pass complexity of (up to terms) is necessary and sufficient to obtain the *worst-case optimal* sample complexity of with a single-arm memory. However, our understanding of multi-pass algorithms with known is still limited. Here, the key open problem is how many passes are required to achieve the complexity, i.e., arm pulls, with a sublinear memory size. In this work, we show that the ``right answer'' for the question is passes (up to terms). We first present a lower bound, showing that any algorithm that finds the best arm with slightly sublinear memory -- a memory of arms -- and arm pulls has to make passes over the stream. We then show a nearly-matching algorithm that assuming the knowledge of , finds the best arm with arm pulls and a *single arm* memory.

Paper Structure

This paper contains 33 sections, 20 theorems, 88 equations, 4 figures, 4 tables, 4 algorithms.

Key Result

Lemma 3.1

Consider two arms with a Bernoulli reward distribution whose mean is parameterized as follows. where $\rho\in (0,\frac{1}{2}]$ is the probability for the reward to be more than $\frac{1}{2}$, and $\alpha, \beta >0$ satisfy $\alpha+\beta<\frac{1}{2}$. Any algorithm to determine the reward of the arms with a success probability of at least $(1-\rho+\varepsilon)$ has to use $\frac{1}{4}\cdot \fr

Figures (4)

  • Figure 1: An illustration of the general $(B+1)$-batched instance distribution (\ref{['def:batch-instance']}) and the $\mathcal{P}\xspace(B, C, \gamma)$ instance distribution. The mean rewards of arms are ranked in the decrement order from left to right for illustration purposes -- their positions inside the batches are uniformly at random.
  • Figure 2: The comparison between algorithms on the sample complexity and the number of passes in the uniform setting. Samples numbers are taken $\log_{10}(\cdot)$ for better illustration. The graphs are reported by $30$ independent runs. AW stands for the single-pass algorithm of AssadiW20, and JHTX stands for the single-pass algorithm of JinH0X21.
  • Figure 3: The comparison between algorithms on the sample complexity and the number of passes in the arithmetic progression setting. Samples numbers are taken $\log_{10}(\cdot)$ for better illustration. The graphs are reported by $30$ independent runs. AW stands for the single-pass algorithm of AssadiW20, and JHTX stands for the single-pass algorithm of JinH0X21.
  • Figure 4: The comparison between algorithms on the sample complexity and the number of passes in the arithmetic progression setting. Samples numbers are taken $\log_{10}(\cdot)$ for better illustration. The graphs are reported by $30$ independent runs. AW stands for the single-pass algorithm of AssadiW20, and JHTX stands for the single-pass algorithm of JinH0X21.

Theorems & Definitions (52)

  • Remark 2.1
  • Lemma 3.1
  • proof
  • Claim 3.2
  • proof
  • Lemma 3.3
  • proof
  • Claim 3.4
  • proof
  • Lemma 3.5
  • ...and 42 more