Tight Gap-Dependent Memory-Regret Trade-Off for Single-Pass Streaming Stochastic Multi-Armed Bandits
Zichun Ye, Chihao Zhang, Jiahao Zhao
TL;DR
This work addresses gap-dependent regret in single-pass streaming stochastic MAB under a memory constraint. It introduces an α-parametrized framework that yields tight non-asymptotic regret bounds in two memory regimes: large memory (m ≥ 2n/3) and small memory (m < 2n/3), with regret scaling as ∑_{i:Δ_i>0} Δ_i^{1−2α} multiplied by memory- and time-dependent factors. The authors provide both upper bounds via two tailored algorithms and matching lower bounds by constructing hard instances that reduce to a Best Arm Retention problem, establishing the first tight gap-dependent results for streaming MAB. The analysis highlights a non-smooth dependence on memory size and demonstrates substantial T- and m-dependent improvements over prior results in memory-constrained streaming settings, clarifying the fundamental trade-offs between memory and exploration in streaming bandits.
Abstract
We study the problem of minimizing gap-dependent regret for single-pass streaming stochastic multi-armed bandits (MAB). In this problem, the $n$ arms are present in a stream, and at most $m<n$ arms and their statistics can be stored in the memory. We establish tight non-asymptotic regret bounds regarding all relevant parameters, including the number of arms $n$, the memory size $m$, the number of rounds $T$ and $(Δ_i)_{i\in [n]}$ where $Δ_i$ is the reward mean gap between the best arm and the $i$-th arm. These gaps are not known in advance by the player. Specifically, for any constant $α\ge 1$, we present two algorithms: one applicable for $m\ge \frac{2}{3}n$ with regret at most $O_α\Big(\frac{(n-m)T^{\frac{1}{α+ 1}}}{n^{1 + {\frac{1}{α+ 1}}}}\displaystyle\sum_{i:Δ_i > 0}Δ_i^{1 - 2α}\Big)$ and another applicable for $m<\frac{2}{3}n$ with regret at most $O_α\Big(\frac{T^{\frac{1}{α+1}}}{m^{\frac{1}{α+1}}}\displaystyle\sum_{i:Δ_i > 0}Δ_i^{1 - 2α}\Big)$. We also prove matching lower bounds for both cases by showing that for any constant $α\ge 1$ and any $m\leq k < n$, there exists a set of hard instances on which the regret of any algorithm is $Ω_α\Big(\frac{(k-m+1) T^{\frac{1}{α+1}}}{k^{1 + \frac{1}{α+1}}} \sum_{i:Δ_i > 0}Δ_i^{1-2α}\Big)$. This is the first tight gap-dependent regret bound for streaming MAB. Prior to our work, an $O\Big(\sum_{i\colonΔ>0} \frac{\sqrt{T}\log T}{Δ_i}\Big)$ upper bound for the special case of $α=1$ and $m=O(1)$ was established by Agarwal, Khanna and Patil (COLT'22). In contrast, our results provide the correct order of regret as $Θ\Big(\frac{1}{\sqrt{m}}\sum_{i\colonΔ>0}\frac{\sqrt{T}}{Δ_i}\Big)$.
