Table of Contents
Fetching ...

An Iconic Heavy Hitter Algorithm Made Private

Rayne Holland

TL;DR

The paper addresses heavy hitter detection in data streams under update-level differential privacy. It introduces the first DP variant of SpaceSaving for the single-observation model, achieved by processing a non-private SpaceSaving with calibrated Laplace noise and a stability-based threshold, plus an expanded capacity to preserve recall. It also presents a generic, memory-efficient wrapper that converts any DP frequency oracle into a label-private heavy hitter mechanism with O(k) extra space, enabling plug-and-play private heavy hitter extraction. Experimental results on synthetic and real data demonstrate that DP SpaceSaving consistently outperforms the DP Misra Gries baseline in utility while maintaining competitive memory and throughput, and that the proposed wrapper effectively privatizes frequency-oracle outputs for heavy hitter recovery. Overall, the work shows that SpaceSaving's empirical advantages extend into the private setting and provides practical methods for private, scalable heavy hitter identification in streaming contexts.

Abstract

Identifying heavy hitters in data streams is a fundamental problem with widespread applications in modern analytics systems. These streams are often derived from sensitive user activity, making update-level privacy guarantees necessary. While recent work has adapted the classical heavy hitter algorithm Misra-Gries to satisfy differential privacy in the streaming model, the privatization of other heavy hitter algorithms with better empirical utility is absent. Under this observation, we present the first differentially private variant of the SpaceSaving algorithm, which, in the non-private setting, is regarded as the state-of-the-art in practice. Our construction post-processes a non-private SpaceSaving summary by injecting asymptotically optimal noise and applying a carefully calibrated selection rule that suppresses unstable labels. This yields strong privacy guarantees while preserving the empirical advantages of SpaceSaving. Second, we introduce a generic method for extracting heavy hitters from any differentially private frequency oracle in the data stream model. The method requires only O(k) additional memory, where k is the number of heavy items, and provides a mechanism for safely releasing item identities from noisy frequency estimates. This yields an efficient, plug-and-play approach for private heavy hitter recovery from linear sketches. Finally, we conduct an experimental evaluation on synthetic and real-world datasets. Across a wide range of privacy parameters and space budgets, our method provides superior utility to the existing differentially private Misra-Gries algorithm. Our results demonstrate that the empirical superiority of SpaceSaving survives privatization and that efficient, practical heavy hitter identification is achievable under strong differential privacy guarantees.

An Iconic Heavy Hitter Algorithm Made Private

TL;DR

The paper addresses heavy hitter detection in data streams under update-level differential privacy. It introduces the first DP variant of SpaceSaving for the single-observation model, achieved by processing a non-private SpaceSaving with calibrated Laplace noise and a stability-based threshold, plus an expanded capacity to preserve recall. It also presents a generic, memory-efficient wrapper that converts any DP frequency oracle into a label-private heavy hitter mechanism with O(k) extra space, enabling plug-and-play private heavy hitter extraction. Experimental results on synthetic and real data demonstrate that DP SpaceSaving consistently outperforms the DP Misra Gries baseline in utility while maintaining competitive memory and throughput, and that the proposed wrapper effectively privatizes frequency-oracle outputs for heavy hitter recovery. Overall, the work shows that SpaceSaving's empirical advantages extend into the private setting and provides practical methods for private, scalable heavy hitter identification in streaming contexts.

Abstract

Identifying heavy hitters in data streams is a fundamental problem with widespread applications in modern analytics systems. These streams are often derived from sensitive user activity, making update-level privacy guarantees necessary. While recent work has adapted the classical heavy hitter algorithm Misra-Gries to satisfy differential privacy in the streaming model, the privatization of other heavy hitter algorithms with better empirical utility is absent. Under this observation, we present the first differentially private variant of the SpaceSaving algorithm, which, in the non-private setting, is regarded as the state-of-the-art in practice. Our construction post-processes a non-private SpaceSaving summary by injecting asymptotically optimal noise and applying a carefully calibrated selection rule that suppresses unstable labels. This yields strong privacy guarantees while preserving the empirical advantages of SpaceSaving. Second, we introduce a generic method for extracting heavy hitters from any differentially private frequency oracle in the data stream model. The method requires only O(k) additional memory, where k is the number of heavy items, and provides a mechanism for safely releasing item identities from noisy frequency estimates. This yields an efficient, plug-and-play approach for private heavy hitter recovery from linear sketches. Finally, we conduct an experimental evaluation on synthetic and real-world datasets. Across a wide range of privacy parameters and space budgets, our method provides superior utility to the existing differentially private Misra-Gries algorithm. Our results demonstrate that the empirical superiority of SpaceSaving survives privatization and that efficient, practical heavy hitter identification is achievable under strong differential privacy guarantees.

Paper Structure

This paper contains 38 sections, 21 theorems, 52 equations, 12 figures, 1 table, 5 algorithms.

Key Result

Lemma 1

Let $f: \mathcal{X} \to \mathbb{R}^d$ be a function with $\ell_1$-sensitivity at most $1$, that is, for any pair of neighboring inputs $X, X' \in \mathcal{X}$, Let $\varepsilon > 0$ and define the randomized mechanism where each $Z_i \sim \mathtt{Laplace}(1/\varepsilon)$ independently. Then $\mathcal{M}$ satisfies $(\varepsilon, 0)$-$\mathsf{DP}$.

Figures (12)

  • Figure 1: Example of transition to $(\mathsf{S1})$ when element $z \notin \mathcal{T}$ arrives from $X$ and $X^{\prime}$ provides no update.
  • Figure 2: Example of transition to $(\mathsf{S2})$ when element $z \notin \mathcal{T}$ arrives from $X$ and $X^{\prime}$ provides no update. Isolated elements are highlighted in red.
  • Figure 3: Example of a transition from $(\mathsf{S1})$ to $(\mathsf{S2})$ when an element $z \notin \mathcal{T} \cap \mathcal{T}^{\prime}$ arrives in the stream.
  • Figure 4: Example of a transition from $(\mathsf{S2})$ to $(\mathsf{S1})$ when an element $y \notin \mathcal{T} \cap \mathcal{T}^{\prime}$ arrives in the stream. Note that this transition also occurs if either $d \in \mathcal{T}\setminus \mathcal{T}^{\prime}$ or $c \in \mathcal{T}\setminus\mathcal{T}^{\prime}$ arrive in this position.
  • Figure 5: Example of a transition from $(\mathsf{S2})$ to $(\mathsf{S3})$ when an element $y \notin \mathcal{T} \cup \mathcal{T}^{\prime}$ arrives in the stream.
  • ...and 7 more figures

Theorems & Definitions (33)

  • Definition 1: $(\varepsilon, \delta)$-Differential Privacy
  • Lemma 1: Laplace Mechanism for Vector Valued Functions
  • Lemma 2: lebeda2023better
  • Lemma 3: Additive Error of the $\mathsf{SpaceSaving}$
  • Lemma 4: Approximation error for $\mathsf{CountMinSketch}$ cormode2005improved
  • Lemma 5: Approximation error for $\mathsf{CountSketch}$ charikar2002finding
  • Lemma 6
  • Corollary 1
  • Lemma 7
  • proof
  • ...and 23 more