Perfect Sampling in Turnstile Streams Beyond Small Moments
David P. Woodruff, Shenghao Xie, Samson Zhou
TL;DR
This work extends perfect sampling in turnstile streams beyond small moments by delivering a first perfect $L_p$ sampler for $p>2$ using $\tilde{O}(n^{1-2/p})$ bits, and a corresponding $(1+\varepsilon)$-approximation for the sampled value with additive polylogarithmic overhead. The authors develop both exact (integer and fractional $p$) and general polynomial samplers, as well as an efficient approximate $L_p$ sampler with fast update time via a two-stage CountSketch construction and derandomization. The framework is further extended to perfect polynomial samplers and specific non-polynomial $G$-functions (e.g., $G(z)=\log(1+|z|)$ and caps $G(z)=\min(T,|z|^p)$), and is applied to norm/moment estimation on data subsets revealed post-stream. A matching lower bound on sketching dimension is provided, underscoring space-optimality up to polylog factors, and a general rejection-sampling framework ties these samplers to practical tasks like forgetful data processing. Overall, the results significantly broaden the repertoire of sublinear-space samplers in turnstile streams, with implications for privacy-aware analytics and efficient subset-norm computations in large-scale data systems.
Abstract
Given a vector $x \in \mathbb{R}^n$ induced by a turnstile stream $S$, a non-negative function $G: \mathbb{R} \to \mathbb{R}$, a perfect $G$-sampler outputs an index $i$ with probability $\frac{G(x_i)}{\sum_{j\in[n]} G(x_j)}+\frac{1}{\text{poly}(n)}$. Jayaram and Woodruff (FOCS 2018) introduced a perfect $L_p$-sampler, where $G(z)=|z|^p$, for $p\in(0,2]$. In this paper, we solve this problem for $p>2$ by a sampling-and-rejection method. Our algorithm runs in $n^{1-2/p} \cdot \text{polylog}(n)$ bits of space, which is tight up to polylogarithmic factors in $n$. Our algorithm also provides a $(1+\varepsilon)$-approximation to the sampled item $x_i$ with high probability using an additional $\varepsilon^{-2} n^{1-2/p} \cdot \text{polylog}(n)$ bits of space. Interestingly, we show our techniques can be generalized to perfect polynomial samplers on turnstile streams, which is a class of functions that is not scale-invariant, in contrast to the existing perfect $L_p$ samplers. We also achieve perfect samplers for the logarithmic function $G(z)=\log(1+|z|)$ and the cap function $G(z)=\min(T,|z|^p)$. Finally, we give an application of our results to the problem of norm/moment estimation for a subset $\mathcal{Q}$ of coordinates of a vector, revealed only after the data stream is processed, e.g., when the set $\mathcal{Q}$ represents a range query, or the set $n\setminus\mathcal{Q}$ represents a collection of entities who wish for their information to be expunged from the dataset.
