Table of Contents
Fetching ...

Tight Gap-Dependent Memory-Regret Trade-Off for Single-Pass Streaming Stochastic Multi-Armed Bandits

Zichun Ye, Chihao Zhang, Jiahao Zhao

TL;DR

This work addresses gap-dependent regret in single-pass streaming stochastic MAB under a memory constraint. It introduces an α-parametrized framework that yields tight non-asymptotic regret bounds in two memory regimes: large memory (m ≥ 2n/3) and small memory (m < 2n/3), with regret scaling as ∑_{i:Δ_i>0} Δ_i^{1−2α} multiplied by memory- and time-dependent factors. The authors provide both upper bounds via two tailored algorithms and matching lower bounds by constructing hard instances that reduce to a Best Arm Retention problem, establishing the first tight gap-dependent results for streaming MAB. The analysis highlights a non-smooth dependence on memory size and demonstrates substantial T- and m-dependent improvements over prior results in memory-constrained streaming settings, clarifying the fundamental trade-offs between memory and exploration in streaming bandits.

Abstract

We study the problem of minimizing gap-dependent regret for single-pass streaming stochastic multi-armed bandits (MAB). In this problem, the $n$ arms are present in a stream, and at most $m<n$ arms and their statistics can be stored in the memory. We establish tight non-asymptotic regret bounds regarding all relevant parameters, including the number of arms $n$, the memory size $m$, the number of rounds $T$ and $(Δ_i)_{i\in [n]}$ where $Δ_i$ is the reward mean gap between the best arm and the $i$-th arm. These gaps are not known in advance by the player. Specifically, for any constant $α\ge 1$, we present two algorithms: one applicable for $m\ge \frac{2}{3}n$ with regret at most $O_α\Big(\frac{(n-m)T^{\frac{1}{α+ 1}}}{n^{1 + {\frac{1}{α+ 1}}}}\displaystyle\sum_{i:Δ_i > 0}Δ_i^{1 - 2α}\Big)$ and another applicable for $m<\frac{2}{3}n$ with regret at most $O_α\Big(\frac{T^{\frac{1}{α+1}}}{m^{\frac{1}{α+1}}}\displaystyle\sum_{i:Δ_i > 0}Δ_i^{1 - 2α}\Big)$. We also prove matching lower bounds for both cases by showing that for any constant $α\ge 1$ and any $m\leq k < n$, there exists a set of hard instances on which the regret of any algorithm is $Ω_α\Big(\frac{(k-m+1) T^{\frac{1}{α+1}}}{k^{1 + \frac{1}{α+1}}} \sum_{i:Δ_i > 0}Δ_i^{1-2α}\Big)$. This is the first tight gap-dependent regret bound for streaming MAB. Prior to our work, an $O\Big(\sum_{i\colonΔ>0} \frac{\sqrt{T}\log T}{Δ_i}\Big)$ upper bound for the special case of $α=1$ and $m=O(1)$ was established by Agarwal, Khanna and Patil (COLT'22). In contrast, our results provide the correct order of regret as $Θ\Big(\frac{1}{\sqrt{m}}\sum_{i\colonΔ>0}\frac{\sqrt{T}}{Δ_i}\Big)$.

Tight Gap-Dependent Memory-Regret Trade-Off for Single-Pass Streaming Stochastic Multi-Armed Bandits

TL;DR

This work addresses gap-dependent regret in single-pass streaming stochastic MAB under a memory constraint. It introduces an α-parametrized framework that yields tight non-asymptotic regret bounds in two memory regimes: large memory (m ≥ 2n/3) and small memory (m < 2n/3), with regret scaling as ∑_{i:Δ_i>0} Δ_i^{1−2α} multiplied by memory- and time-dependent factors. The authors provide both upper bounds via two tailored algorithms and matching lower bounds by constructing hard instances that reduce to a Best Arm Retention problem, establishing the first tight gap-dependent results for streaming MAB. The analysis highlights a non-smooth dependence on memory size and demonstrates substantial T- and m-dependent improvements over prior results in memory-constrained streaming settings, clarifying the fundamental trade-offs between memory and exploration in streaming bandits.

Abstract

We study the problem of minimizing gap-dependent regret for single-pass streaming stochastic multi-armed bandits (MAB). In this problem, the arms are present in a stream, and at most arms and their statistics can be stored in the memory. We establish tight non-asymptotic regret bounds regarding all relevant parameters, including the number of arms , the memory size , the number of rounds and where is the reward mean gap between the best arm and the -th arm. These gaps are not known in advance by the player. Specifically, for any constant , we present two algorithms: one applicable for with regret at most and another applicable for with regret at most . We also prove matching lower bounds for both cases by showing that for any constant and any , there exists a set of hard instances on which the regret of any algorithm is . This is the first tight gap-dependent regret bound for streaming MAB. Prior to our work, an upper bound for the special case of and was established by Agarwal, Khanna and Patil (COLT'22). In contrast, our results provide the correct order of regret as .

Paper Structure

This paper contains 23 sections, 20 theorems, 52 equations, 1 figure, 1 table, 3 algorithms.

Key Result

Lemma 2

Giving $n$ arms which have Bernoulli rewards in the memory and running UCB on them for $T$ rounds, then for any $\mathtt{arm}\xspace_i$ with $\Delta_i > 0$: $\mathbf{E}\left[T_i\right] \leq \frac{8\log T}{\Delta_i^2}.$

Figures (1)

  • Figure 1: Regret with respect to the memory size $m$.

Theorems & Definitions (31)

  • Definition 1
  • Lemma 2: Aue02
  • Lemma 3
  • Theorem 4
  • proof
  • Theorem 5
  • proof
  • Theorem 6
  • Corollary 7
  • Lemma 8: implicitly in CHZ24
  • ...and 21 more