Table of Contents
Fetching ...

Weighted Reservoir Sampling With Replacement from Data Streams

Adriano Meligrana, Adriano Fazzone

TL;DR

This paper addresses weighted reservoir sampling with replacement from data streams of unknown size, where inclusion probabilities are proportional to item weights. It introduces WRSWR-SKIP, a one-pass, skip-based algorithm that maintains a fixed-size reservoir $\mathcal{R}$ of size $m$ and uses a dynamic threshold $W_{\text{skip}} = W/q^{1/m}$ to efficiently skip non-updating items; when updated, it draws $k \sim B_{>0}\left(m, \frac{w_t}{W}\right)$ and inserts the current item into $k$ random positions, ensuring an unbiased sample. The authors provide a formal correctness proof and show that the Add operation runs in $O\left(m \log \frac{W_N}{w_1}\right)$ random variates, while Get operates in $O(1)$ time. Empirical results on synthetic data and the Wikipedia Clickstream demonstrate that WRSWR-SKIP outperforms baselines, offering practical, single-pass weighted sampling for streaming analytics with unknown population size. Overall, the method enables immediate use of weighted samples without post-processing, with strong theoretical guarantees and favorable performance in practice.

Abstract

In this work, we present a new random sampling method for data streams where the probability of an element's inclusion in the sample is proportional to a weight associated with that element. Our method is based on sampling with replacement, although most of the literature on this topic has focused on sampling without replacement. Our algorithm generates a weighted random sample in one pass over a population of unknown size. At any point in time, the sample is representative of the population seen so far and can be directly used by other modules without requiring any post-processing. We formally prove the correctness and efficiency of our method. An experimental analysis shows the performance of our method in practice when compared to state-of-the-art methods.

Weighted Reservoir Sampling With Replacement from Data Streams

TL;DR

This paper addresses weighted reservoir sampling with replacement from data streams of unknown size, where inclusion probabilities are proportional to item weights. It introduces WRSWR-SKIP, a one-pass, skip-based algorithm that maintains a fixed-size reservoir of size and uses a dynamic threshold to efficiently skip non-updating items; when updated, it draws and inserts the current item into random positions, ensuring an unbiased sample. The authors provide a formal correctness proof and show that the Add operation runs in random variates, while Get operates in time. Empirical results on synthetic data and the Wikipedia Clickstream demonstrate that WRSWR-SKIP outperforms baselines, offering practical, single-pass weighted sampling for streaming analytics with unknown population size. Overall, the method enables immediate use of weighted samples without post-processing, with strong theoretical guarantees and favorable performance in practice.

Abstract

In this work, we present a new random sampling method for data streams where the probability of an element's inclusion in the sample is proportional to a weight associated with that element. Our method is based on sampling with replacement, although most of the literature on this topic has focused on sampling without replacement. Our algorithm generates a weighted random sample in one pass over a population of unknown size. At any point in time, the sample is representative of the population seen so far and can be directly used by other modules without requiring any post-processing. We formally prove the correctness and efficiency of our method. An experimental analysis shows the performance of our method in practice when compared to state-of-the-art methods.
Paper Structure (5 sections, 2 theorems, 2 figures, 1 table, 1 algorithm)

This paper contains 5 sections, 2 theorems, 2 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

Algorithm alg:our keeps an unbiased weighted random sample with replacement at each iteration.

Figures (2)

  • Figure 1: Average time (ns) over 100 executions for Add (top) and Get (bottom) operations vs. reservoir size ($m$). Columns show performance on streams with decreasing, constant, and increasing weights, respectively. 10M items in the streams.
  • Figure 2: Average time (ns) over 100 executions for Add (top) and Get (bottom) operations vs. reservoir size ($m$). Stream of 34M items from Wikipedia Clickstream dataset.

Theorems & Definitions (2)

  • Lemma 1
  • Lemma 2