Table of Contents
Fetching ...

Perfect $L_p$ Sampling with Polylogarithmic Update Time

William Swartworth, David P. Woodruff, Samson Zhou

TL;DR

The paper resolves the open problem of achieving a perfect $L_p$ sampler for $p\in(0,2)$ in turnstile streams with optimal memory and fast updates. It combines a polylogarithmic-time, near-optimal-space framework by transforming the input with exponential scaling, leveraging dense Gaussian sketches for heavy hitters, and using a Gil-Pelaez-based inversion to sample from a complex, heavy-tailed distribution. A key novelty is the simulation oracle based on Poisson approximation that avoids explicit duplication, together with a rigorous derandomization via the GKM PRG to obtain a fully deterministic, polylogarithmic-update-time sampler. The result advances exact streaming sampling techniques, enabling efficient, plug-in perfect $L_p$ sampling for applications in norm estimation and anomaly detection in large-scale data streams.

Abstract

Perfect $L_p$ sampling in a stream was introduced by Jayaram and Woodruff (FOCS 2018) as a streaming primitive which, given turnstile updates to a vector $x \in \{-\text{poly}(n), \ldots, \text{poly}(n)\}^n$, outputs an index $i^* \in \{1, 2, \ldots, n\}$ such that the probability of returning index $i$ is exactly \[\Pr[i^* = i] = \frac{|x_i|^p}{\|x\|_p^p} \pm \frac{1}{n^C},\] where $C > 0$ is an arbitrarily large constant. Jayaram and Woodruff achieved the optimal $\tilde{O}(\log^2 n)$ bits of memory for $0 < p < 2$, but their update time is at least $n^C$ per stream update. Thus an important open question is to achieve efficient update time while maintaining optimal space. For $0 < p < 2$, we give the first perfect $L_p$-sampler with the same optimal amount of memory but with only $\text{poly}(\log n)$ update time. Crucial to our result is an efficient simulation of a sum of reciprocals of powers of truncated exponential random variables by approximating its characteristic function, using the Gil-Pelaez inversion formula, and applying variants of the trapezoid formula to quickly approximate it.

Perfect $L_p$ Sampling with Polylogarithmic Update Time

TL;DR

The paper resolves the open problem of achieving a perfect sampler for in turnstile streams with optimal memory and fast updates. It combines a polylogarithmic-time, near-optimal-space framework by transforming the input with exponential scaling, leveraging dense Gaussian sketches for heavy hitters, and using a Gil-Pelaez-based inversion to sample from a complex, heavy-tailed distribution. A key novelty is the simulation oracle based on Poisson approximation that avoids explicit duplication, together with a rigorous derandomization via the GKM PRG to obtain a fully deterministic, polylogarithmic-update-time sampler. The result advances exact streaming sampling techniques, enabling efficient, plug-in perfect sampling for applications in norm estimation and anomaly detection in large-scale data streams.

Abstract

Perfect sampling in a stream was introduced by Jayaram and Woodruff (FOCS 2018) as a streaming primitive which, given turnstile updates to a vector , outputs an index such that the probability of returning index is exactly \[\Pr[i^* = i] = \frac{|x_i|^p}{\|x\|_p^p} \pm \frac{1}{n^C},\] where is an arbitrarily large constant. Jayaram and Woodruff achieved the optimal bits of memory for , but their update time is at least per stream update. Thus an important open question is to achieve efficient update time while maintaining optimal space. For , we give the first perfect -sampler with the same optimal amount of memory but with only update time. Crucial to our result is an efficient simulation of a sum of reciprocals of powers of truncated exponential random variables by approximating its characteristic function, using the Gil-Pelaez inversion formula, and applying variants of the trapezoid formula to quickly approximate it.

Paper Structure

This paper contains 26 sections, 28 theorems, 123 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Theorem 1.2

For any $p\in(0,2)$ and failure probability $\delta\in(0,1)$, there exists a perfect $L_p$ sampler on a turnstile stream that succeeds with probability at least $1-\delta$ and uses $\mathop{\mathrm{polylog}}\limits(n)$ time per update. The algorithm uses $\tilde{\mathcal{O}}\left(\log^2 n\log\frac{1

Figures (1)

  • Figure 1: Statistical test for perfect $L_p$ sampler in \ref{['alg:alg:perfect:lp:sample']}.

Theorems & Definitions (53)

  • Definition 1.1: $L_p$-sampler
  • Theorem 1.2
  • Theorem 3.1: Marcinkiewicz–Zygmund inequality
  • Lemma 3.2
  • proof
  • Theorem 3.3
  • Theorem 3.4
  • Definition 3.5: Exponential random variable
  • Lemma 3.6
  • Lemma 3.7: Lemma 2 in JayaramW18
  • ...and 43 more