Table of Contents
Fetching ...

Perfect Sampling in Turnstile Streams Beyond Small Moments

David P. Woodruff, Shenghao Xie, Samson Zhou

TL;DR

This work extends perfect sampling in turnstile streams beyond small moments by delivering a first perfect $L_p$ sampler for $p>2$ using $\tilde{O}(n^{1-2/p})$ bits, and a corresponding $(1+\varepsilon)$-approximation for the sampled value with additive polylogarithmic overhead. The authors develop both exact (integer and fractional $p$) and general polynomial samplers, as well as an efficient approximate $L_p$ sampler with fast update time via a two-stage CountSketch construction and derandomization. The framework is further extended to perfect polynomial samplers and specific non-polynomial $G$-functions (e.g., $G(z)=\log(1+|z|)$ and caps $G(z)=\min(T,|z|^p)$), and is applied to norm/moment estimation on data subsets revealed post-stream. A matching lower bound on sketching dimension is provided, underscoring space-optimality up to polylog factors, and a general rejection-sampling framework ties these samplers to practical tasks like forgetful data processing. Overall, the results significantly broaden the repertoire of sublinear-space samplers in turnstile streams, with implications for privacy-aware analytics and efficient subset-norm computations in large-scale data systems.

Abstract

Given a vector $x \in \mathbb{R}^n$ induced by a turnstile stream $S$, a non-negative function $G: \mathbb{R} \to \mathbb{R}$, a perfect $G$-sampler outputs an index $i$ with probability $\frac{G(x_i)}{\sum_{j\in[n]} G(x_j)}+\frac{1}{\text{poly}(n)}$. Jayaram and Woodruff (FOCS 2018) introduced a perfect $L_p$-sampler, where $G(z)=|z|^p$, for $p\in(0,2]$. In this paper, we solve this problem for $p>2$ by a sampling-and-rejection method. Our algorithm runs in $n^{1-2/p} \cdot \text{polylog}(n)$ bits of space, which is tight up to polylogarithmic factors in $n$. Our algorithm also provides a $(1+\varepsilon)$-approximation to the sampled item $x_i$ with high probability using an additional $\varepsilon^{-2} n^{1-2/p} \cdot \text{polylog}(n)$ bits of space. Interestingly, we show our techniques can be generalized to perfect polynomial samplers on turnstile streams, which is a class of functions that is not scale-invariant, in contrast to the existing perfect $L_p$ samplers. We also achieve perfect samplers for the logarithmic function $G(z)=\log(1+|z|)$ and the cap function $G(z)=\min(T,|z|^p)$. Finally, we give an application of our results to the problem of norm/moment estimation for a subset $\mathcal{Q}$ of coordinates of a vector, revealed only after the data stream is processed, e.g., when the set $\mathcal{Q}$ represents a range query, or the set $n\setminus\mathcal{Q}$ represents a collection of entities who wish for their information to be expunged from the dataset.

Perfect Sampling in Turnstile Streams Beyond Small Moments

TL;DR

This work extends perfect sampling in turnstile streams beyond small moments by delivering a first perfect sampler for using bits, and a corresponding -approximation for the sampled value with additive polylogarithmic overhead. The authors develop both exact (integer and fractional ) and general polynomial samplers, as well as an efficient approximate sampler with fast update time via a two-stage CountSketch construction and derandomization. The framework is further extended to perfect polynomial samplers and specific non-polynomial -functions (e.g., and caps ), and is applied to norm/moment estimation on data subsets revealed post-stream. A matching lower bound on sketching dimension is provided, underscoring space-optimality up to polylog factors, and a general rejection-sampling framework ties these samplers to practical tasks like forgetful data processing. Overall, the results significantly broaden the repertoire of sublinear-space samplers in turnstile streams, with implications for privacy-aware analytics and efficient subset-norm computations in large-scale data systems.

Abstract

Given a vector induced by a turnstile stream , a non-negative function , a perfect -sampler outputs an index with probability . Jayaram and Woodruff (FOCS 2018) introduced a perfect -sampler, where , for . In this paper, we solve this problem for by a sampling-and-rejection method. Our algorithm runs in bits of space, which is tight up to polylogarithmic factors in . Our algorithm also provides a -approximation to the sampled item with high probability using an additional bits of space. Interestingly, we show our techniques can be generalized to perfect polynomial samplers on turnstile streams, which is a class of functions that is not scale-invariant, in contrast to the existing perfect samplers. We also achieve perfect samplers for the logarithmic function and the cap function . Finally, we give an application of our results to the problem of norm/moment estimation for a subset of coordinates of a vector, revealed only after the data stream is processed, e.g., when the set represents a range query, or the set represents a collection of entities who wish for their information to be expunged from the dataset.

Paper Structure

This paper contains 36 sections, 57 theorems, 119 equations, 1 table, 8 algorithms.

Key Result

Theorem 1.2

For any $p>2$ and failure probability $\delta\in(0,1)$, there exists a perfect $L_p$ sampler on a turnstile stream that uses $\tilde{\mathcal{O}}\left(n^{1-2/p}\log\frac{1}{\delta}\right)$ bits of space and succeeds with probability at least $1-\delta$. Moreover, it obtains a $(1+\varepsilon)$-estim

Theorems & Definitions (95)

  • Definition 1.1: $G$-sampler
  • Theorem 1.2
  • Theorem 1.3
  • Theorem 1.4
  • Theorem 1.5
  • Theorem 1.6
  • Theorem 1.7: Khintchine inequality
  • Proposition 1.8
  • Definition 1.9: Perfect $L_p$ sampler
  • Theorem 1.10: Perfect $L_p$ sampler for $p \le 2$, c.f. Theorem 9 in JayaramW18
  • ...and 85 more