Weighted Reservoir Sampling With Replacement from Data Streams
Adriano Meligrana, Adriano Fazzone
TL;DR
This paper addresses weighted reservoir sampling with replacement from data streams of unknown size, where inclusion probabilities are proportional to item weights. It introduces WRSWR-SKIP, a one-pass, skip-based algorithm that maintains a fixed-size reservoir $\mathcal{R}$ of size $m$ and uses a dynamic threshold $W_{\text{skip}} = W/q^{1/m}$ to efficiently skip non-updating items; when updated, it draws $k \sim B_{>0}\left(m, \frac{w_t}{W}\right)$ and inserts the current item into $k$ random positions, ensuring an unbiased sample. The authors provide a formal correctness proof and show that the Add operation runs in $O\left(m \log \frac{W_N}{w_1}\right)$ random variates, while Get operates in $O(1)$ time. Empirical results on synthetic data and the Wikipedia Clickstream demonstrate that WRSWR-SKIP outperforms baselines, offering practical, single-pass weighted sampling for streaming analytics with unknown population size. Overall, the method enables immediate use of weighted samples without post-processing, with strong theoretical guarantees and favorable performance in practice.
Abstract
In this work, we present a new random sampling method for data streams where the probability of an element's inclusion in the sample is proportional to a weight associated with that element. Our method is based on sampling with replacement, although most of the literature on this topic has focused on sampling without replacement. Our algorithm generates a weighted random sample in one pass over a population of unknown size. At any point in time, the sample is representative of the population seen so far and can be directly used by other modules without requiring any post-processing. We formally prove the correctness and efficiency of our method. An experimental analysis shows the performance of our method in practice when compared to state-of-the-art methods.
