Table of Contents
Fetching ...

Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures

Alina Ene, Alessandro Epasto, Vahab Mirrokni, Hoai-An Nguyen, Huy L. Nguyen, David P. Woodruff, Peilin Zhong

TL;DR

This work delivers the first one-pass turnstile streaming algorithms for maximum coverage and extends the approach to targeted and general fingerprinting using linear sketches. It introduces a novel $n^p - F_p$ complement estimator for $p\ge 2$ and provides a practical submodular maximization framework that preserves approximation guarantees under sketching. The results show sublinear-space, near-optimal $(1-1/e-\varepsilon)$-approximation performance with polylog update time and significant empirical speedups (up to $210\times$) over prior methods, including a dimensionality-reduction application for machine learning pipelines. Together, these methods enable real-time risk assessment and scalable privacy-preserving data analysis in dynamic data streams.

Abstract

In the maximum coverage problem we are given $d$ subsets from a universe $[n]$, and the goal is to output $k$ subsets such that their union covers the largest possible number of distinct items. We present the first algorithm for maximum coverage in the turnstile streaming model, where updates which insert or delete an item from a subset come one-by-one. Notably our algorithm only uses $poly\log n$ update time. We also present turnstile streaming algorithms for targeted and general fingerprinting for risk management where the goal is to determine which features pose the greatest re-identification risk in a dataset. As part of our work, we give a result of independent interest: an algorithm to estimate the complement of the $p^{\text{th}}$ frequency moment of a vector for $p \geq 2$. Empirical evaluation confirms the practicality of our fingerprinting algorithms demonstrating a speedup of up to $210$x over prior work.

Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures

TL;DR

This work delivers the first one-pass turnstile streaming algorithms for maximum coverage and extends the approach to targeted and general fingerprinting using linear sketches. It introduces a novel complement estimator for and provides a practical submodular maximization framework that preserves approximation guarantees under sketching. The results show sublinear-space, near-optimal -approximation performance with polylog update time and significant empirical speedups (up to ) over prior methods, including a dimensionality-reduction application for machine learning pipelines. Together, these methods enable real-time risk assessment and scalable privacy-preserving data analysis in dynamic data streams.

Abstract

In the maximum coverage problem we are given subsets from a universe , and the goal is to output subsets such that their union covers the largest possible number of distinct items. We present the first algorithm for maximum coverage in the turnstile streaming model, where updates which insert or delete an item from a subset come one-by-one. Notably our algorithm only uses update time. We also present turnstile streaming algorithms for targeted and general fingerprinting for risk management where the goal is to determine which features pose the greatest re-identification risk in a dataset. As part of our work, we give a result of independent interest: an algorithm to estimate the complement of the frequency moment of a vector for . Empirical evaluation confirms the practicality of our fingerprinting algorithms demonstrating a speedup of up to x over prior work.

Paper Structure

This paper contains 24 sections, 23 theorems, 11 equations, 3 figures, 6 algorithms.

Key Result

Theorem 1

Given $n \times d$ matrix $\boldsymbol{A}$, integer $k \geq 0$, and $\varepsilon \in (0,1)$, there exists a one-pass turnstile streaming algorithm using $\tilde{O}(d/\varepsilon^3)$ space and $\tilde{O}(1)$ update time that outputs a $(1-1/e - \varepsilon)$ relative approximation to maximum coverage

Figures (3)

  • Figure 1: Targeted Fingerprinting Results for the "Adult" Dataset
  • Figure 2: Targeted Fingerprinting Results for the "US Census Data" Dataset: k vs. accuracy
  • Figure 3: General Fingerprinting Results for the "US Census Data" Dataset

Theorems & Definitions (39)

  • Theorem 1
  • Corollary 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6: Theorem $2.7$ and $3.1$ of BEM2017almost
  • Lemma 7
  • proof
  • Lemma 8
  • proof
  • ...and 29 more