Table of Contents
Fetching ...

Understanding Memory-Regret Trade-Off for Streaming Stochastic Multi-Armed Bandits

Yuchen He, Zichun Ye, Chihao Zhang

TL;DR

This paper analyzes stochastic multi-armed bandits under a memory-limited streaming model with $P$ passes, deriving a tight memory-regret trade-off that depends on $m$, $n$, and $P$. It introduces a memory-aware, multi-pass exploration framework that hinges on best-arm retention (BAR) and best-arm identification (BAI), employing an online stochastic mirror descent (OSMD) subroutine and two FindBest calls per pass to control exploration. The authors prove an upper bound of $\tilde O\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ and a matching lower bound (up to log factors) for sufficiently large $T$, with tight characterizations across large and small memory regimes. This work closes the gap on how memory size interacts with the number of passes to shape regret, offering insights for streaming policy design under stringent memory constraints and multi-pass data access. The results have practical impact for online decision-making in resource-constrained streaming environments where memory and repeated data access are limited but multiple passes are feasible.

Abstract

We study the stochastic multi-armed bandit problem in the $P$-pass streaming model. In this problem, the $n$ arms are present in a stream and at most $m<n$ arms and their statistics can be stored in the memory. We give a complete characterization of the optimal regret in terms of $m, n$ and $P$. Specifically, we design an algorithm with $\tilde O\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ regret and complement it with an $\tilde Ω\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ lower bound when the number of rounds $T$ is sufficiently large. Our results are tight up to a logarithmic factor in $n$ and $P$.

Understanding Memory-Regret Trade-Off for Streaming Stochastic Multi-Armed Bandits

TL;DR

This paper analyzes stochastic multi-armed bandits under a memory-limited streaming model with passes, deriving a tight memory-regret trade-off that depends on , , and . It introduces a memory-aware, multi-pass exploration framework that hinges on best-arm retention (BAR) and best-arm identification (BAI), employing an online stochastic mirror descent (OSMD) subroutine and two FindBest calls per pass to control exploration. The authors prove an upper bound of and a matching lower bound (up to log factors) for sufficiently large , with tight characterizations across large and small memory regimes. This work closes the gap on how memory size interacts with the number of passes to shape regret, offering insights for streaming policy design under stringent memory constraints and multi-pass data access. The results have practical impact for online decision-making in resource-constrained streaming environments where memory and repeated data access are limited but multiple passes are feasible.

Abstract

We study the stochastic multi-armed bandit problem in the -pass streaming model. In this problem, the arms are present in a stream and at most arms and their statistics can be stored in the memory. We give a complete characterization of the optimal regret in terms of and . Specifically, we design an algorithm with regret and complement it with an lower bound when the number of rounds is sufficiently large. Our results are tight up to a logarithmic factor in and .
Paper Structure (64 sections, 31 theorems, 74 equations, 1 figure, 1 table, 7 algorithms)

This paper contains 64 sections, 31 theorems, 74 equations, 1 figure, 1 table, 7 algorithms.

Key Result

Theorem 1

Given a stream with $n$ arms, assuming $T\geq (n+1)^2$, for arbitrary pass number $1\leq P\leq \log\log T - \log\left(12\log\frac{n}{n-m}\right)$ and memory size $2\leq m< n$, there exists a $P$-pass algorithm using a memory of $m$ arms with regret

Figures (1)

  • Figure 1: Each cell in the diagram represents one round, with inward arrows $\operatorname{}$, $\operatorname{}$ and outward arrows $\operatorname{}$ denoting reading some arms from the stream and dropping some from the memory respectively. The symbol $\operatorname{}$ indicates the algorithm reading in the last arm of this pass, signifying the end of this pass. Any multi-pass streaming algorithm can be formalized as this figure described: it decomposes into an exploration phase and an exploitation phase, with each pass in the exploration phase consuming $L_p$ rounds for some (possibly random) $L_p$.

Theorems & Definitions (52)

  • Theorem 1
  • Theorem 2
  • Proposition 3: LG21, Theorem 11
  • Lemma 4
  • Lemma 5
  • proof
  • Lemma 6
  • proof
  • Theorem 7: regret bound for \ref{['algo:large-m-simple']}
  • proof
  • ...and 42 more