Understanding Memory-Regret Trade-Off for Streaming Stochastic Multi-Armed Bandits

Yuchen He; Zichun Ye; Chihao Zhang

Understanding Memory-Regret Trade-Off for Streaming Stochastic Multi-Armed Bandits

Yuchen He, Zichun Ye, Chihao Zhang

TL;DR

This paper analyzes stochastic multi-armed bandits under a memory-limited streaming model with $P$ passes, deriving a tight memory-regret trade-off that depends on $m$, $n$, and $P$. It introduces a memory-aware, multi-pass exploration framework that hinges on best-arm retention (BAR) and best-arm identification (BAI), employing an online stochastic mirror descent (OSMD) subroutine and two FindBest calls per pass to control exploration. The authors prove an upper bound of $\tilde O\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ and a matching lower bound (up to log factors) for sufficiently large $T$, with tight characterizations across large and small memory regimes. This work closes the gap on how memory size interacts with the number of passes to shape regret, offering insights for streaming policy design under stringent memory constraints and multi-pass data access. The results have practical impact for online decision-making in resource-constrained streaming environments where memory and repeated data access are limited but multiple passes are feasible.

Abstract

We study the stochastic multi-armed bandit problem in the $P$-pass streaming model. In this problem, the $n$ arms are present in a stream and at most $m<n$ arms and their statistics can be stored in the memory. We give a complete characterization of the optimal regret in terms of $m, n$ and $P$. Specifically, we design an algorithm with $\tilde O\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ regret and complement it with an $\tilde Ω\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ lower bound when the number of rounds $T$ is sufficiently large. Our results are tight up to a logarithmic factor in $n$ and $P$.

Understanding Memory-Regret Trade-Off for Streaming Stochastic Multi-Armed Bandits

TL;DR

This paper analyzes stochastic multi-armed bandits under a memory-limited streaming model with

passes, deriving a tight memory-regret trade-off that depends on

, and

. It introduces a memory-aware, multi-pass exploration framework that hinges on best-arm retention (BAR) and best-arm identification (BAI), employing an online stochastic mirror descent (OSMD) subroutine and two FindBest calls per pass to control exploration. The authors prove an upper bound of

and a matching lower bound (up to log factors) for sufficiently large

, with tight characterizations across large and small memory regimes. This work closes the gap on how memory size interacts with the number of passes to shape regret, offering insights for streaming policy design under stringent memory constraints and multi-pass data access. The results have practical impact for online decision-making in resource-constrained streaming environments where memory and repeated data access are limited but multiple passes are feasible.

Abstract

We study the stochastic multi-armed bandit problem in the

-pass streaming model. In this problem, the

arms are present in a stream and at most

arms and their statistics can be stored in the memory. We give a complete characterization of the optimal regret in terms of

and

. Specifically, we design an algorithm with

regret and complement it with an

lower bound when the number of rounds

is sufficiently large. Our results are tight up to a logarithmic factor in

and

Paper Structure (64 sections, 31 theorems, 74 equations, 1 figure, 1 table, 7 algorithms)

This paper contains 64 sections, 31 theorems, 74 equations, 1 figure, 1 table, 7 algorithms.

Introduction
Main results
Overview of our algorithms and techniques
Comparison with lower bounds in AKP22
Related work
Organization of the paper
Preliminaries
Multi-armed bandit in streaming model
Multi-armed bandit
Streaming MAB
The mechanism of streaming MAB
Streaming stochastic MAB
Exploration and exploitation phase
More remarks on the memory model
Best arm identification and best arm retention
...and 49 more sections

Key Result

Theorem 1

Given a stream with $n$ arms, assuming $T\geq (n+1)^2$, for arbitrary pass number $1\leq P\leq \log\log T - \log\left(12\log\frac{n}{n-m}\right)$ and memory size $2\leq m< n$, there exists a $P$-pass algorithm using a memory of $m$ arms with regret

Figures (1)

Figure 1: Each cell in the diagram represents one round, with inward arrows $\operatorname{}$, $\operatorname{}$ and outward arrows $\operatorname{}$ denoting reading some arms from the stream and dropping some from the memory respectively. The symbol $\operatorname{}$ indicates the algorithm reading in the last arm of this pass, signifying the end of this pass. Any multi-pass streaming algorithm can be formalized as this figure described: it decomposes into an exploration phase and an exploitation phase, with each pass in the exploration phase consuming $L_p$ rounds for some (possibly random) $L_p$.

Theorems & Definitions (52)

Theorem 1
Theorem 2
Proposition 3: LG21, Theorem 11
Lemma 4
Lemma 5
proof
Lemma 6
proof
Theorem 7: regret bound for \ref{['algo:large-m-simple']}
proof
...and 42 more

Understanding Memory-Regret Trade-Off for Streaming Stochastic Multi-Armed Bandits

TL;DR

Abstract

Understanding Memory-Regret Trade-Off for Streaming Stochastic Multi-Armed Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (52)