Table of Contents
Fetching ...

Unbiased Insights: Optimal Streaming Algorithms for $\ell_p$ Sampling, the Forget Model, and Beyond

Honghao Lin, Hoai-An Nguyen, William Swartworth, David P. Woodruff

TL;DR

This paper advances frequency-moment estimation and ℓ_p sampling in insertion-only data streams under forget and non-linear update models. It develops nearly optimal one-pass ℓ_p samplers for p∈(0,2] and a near-optimal p=2 sampler, with continuous sampling capabilities and improved space bounds. Leveraging these samplers, the authors obtain nearly unbiased estimators for F_p in the α-RFDS Forget Model, resolving open problems, and extend the framework to prefix/suffix deletion and a broad class of contracting updates, including entropy estimation as a corollary. The approach blends heavy-hitter structures, adaptive sparse recovery, derandomization, and Taylor expansions to achieve tight space-accuracy trade-offs across p-regimes, supported by matching lower bounds. Overall, the work broadens the applicability of sublinear-space streaming sketches to non-linear, deletion, and time-windowed data scenarios with strong theoretical guarantees and practical implications for fast, memory-efficient data analysis.

Abstract

We study $\ell_p$ sampling and frequency moment estimation in a single-pass insertion-only data stream. For $p \in (0,2)$, we present a nearly space-optimal approximate $\ell_p$ sampler that uses $\widetilde{O}(\log n \log(1/δ))$ bits of space and for $p = 2$, we present a sampler with space complexity $\widetilde{O}(\log^2 n \log(1/δ))$. This space complexity is optimal for $p \in (0, 2)$ and improves upon prior work by a $\log n$ factor. We further extend our construction to a continuous $\ell_p$ sampler, which outputs a valid sample index at every point during the stream. Leveraging these samplers, we design nearly unbiased estimators for $F_p$ in data streams that include forget operations, which reset individual element frequencies and introduce significant non-linear challenges. As a result, we obtain near-optimal algorithms for estimating $F_p$ for all $p$ in this model, originally proposed by Pavan, Chakraborty, Vinodchandran, and Meel [PODS'24], resolving all three open problems they posed. Furthermore, we generalize this model to what we call the suffix-prefix deletion model, and extend our techniques to estimate entropy as a corollary of our moment estimation algorithms. Finally, we show how to handle arbitrary coordinate-wise functions during the stream, for any $g \in \mathbb{G}$, where $\mathbb{G}$ includes all (linear or non-linear) contraction functions.

Unbiased Insights: Optimal Streaming Algorithms for $\ell_p$ Sampling, the Forget Model, and Beyond

TL;DR

This paper advances frequency-moment estimation and ℓ_p sampling in insertion-only data streams under forget and non-linear update models. It develops nearly optimal one-pass ℓ_p samplers for p∈(0,2] and a near-optimal p=2 sampler, with continuous sampling capabilities and improved space bounds. Leveraging these samplers, the authors obtain nearly unbiased estimators for F_p in the α-RFDS Forget Model, resolving open problems, and extend the framework to prefix/suffix deletion and a broad class of contracting updates, including entropy estimation as a corollary. The approach blends heavy-hitter structures, adaptive sparse recovery, derandomization, and Taylor expansions to achieve tight space-accuracy trade-offs across p-regimes, supported by matching lower bounds. Overall, the work broadens the applicability of sublinear-space streaming sketches to non-linear, deletion, and time-windowed data scenarios with strong theoretical guarantees and practical implications for fast, memory-efficient data analysis.

Abstract

We study sampling and frequency moment estimation in a single-pass insertion-only data stream. For , we present a nearly space-optimal approximate sampler that uses bits of space and for , we present a sampler with space complexity . This space complexity is optimal for and improves upon prior work by a factor. We further extend our construction to a continuous sampler, which outputs a valid sample index at every point during the stream. Leveraging these samplers, we design nearly unbiased estimators for in data streams that include forget operations, which reset individual element frequencies and introduce significant non-linear challenges. As a result, we obtain near-optimal algorithms for estimating for all in this model, originally proposed by Pavan, Chakraborty, Vinodchandran, and Meel [PODS'24], resolving all three open problems they posed. Furthermore, we generalize this model to what we call the suffix-prefix deletion model, and extend our techniques to estimate entropy as a corollary of our moment estimation algorithms. Finally, we show how to handle arbitrary coordinate-wise functions during the stream, for any , where includes all (linear or non-linear) contraction functions.

Paper Structure

This paper contains 55 sections, 52 theorems, 97 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Theorem 1.1

For any constant $p \in (0, 2]$, there is a one-pass streaming algorithm that runs in space $\widetilde{O}(\log^{c(p)} n \log(1/\delta))$, and outputs an index $i$ such that with probability at least $1 - \delta$ we have for every $j \in [n]$, Here $c(p) = 1$ when $0 < p < 2$ and $c(p) = 2$ when $p = 2$.

Figures (1)

  • Figure 1: $F_p$ Estimation with Forget Operations

Theorems & Definitions (83)

  • Theorem 1.1: $\ell_p$ sampler
  • Theorem 1.2: Continuous $\ell_p$ sampler
  • Theorem 1.3: Main result for $F_p$ estimation with forgets
  • Theorem 1.4: Lower bounds for $F_p$ estimation with forgets
  • Definition 3.1: Exponential random variables
  • Definition 3.2: Anti-rank
  • Corollary 3.3
  • Lemma 3.3: JW2021perfect
  • Lemma 3.4: Proposition 1, JW2021perfect
  • Theorem 3.1
  • ...and 73 more