An Iconic Heavy Hitter Algorithm Made Private
Rayne Holland
TL;DR
The paper addresses heavy hitter detection in data streams under update-level differential privacy. It introduces the first DP variant of SpaceSaving for the single-observation model, achieved by processing a non-private SpaceSaving with calibrated Laplace noise and a stability-based threshold, plus an expanded capacity to preserve recall. It also presents a generic, memory-efficient wrapper that converts any DP frequency oracle into a label-private heavy hitter mechanism with O(k) extra space, enabling plug-and-play private heavy hitter extraction. Experimental results on synthetic and real data demonstrate that DP SpaceSaving consistently outperforms the DP Misra Gries baseline in utility while maintaining competitive memory and throughput, and that the proposed wrapper effectively privatizes frequency-oracle outputs for heavy hitter recovery. Overall, the work shows that SpaceSaving's empirical advantages extend into the private setting and provides practical methods for private, scalable heavy hitter identification in streaming contexts.
Abstract
Identifying heavy hitters in data streams is a fundamental problem with widespread applications in modern analytics systems. These streams are often derived from sensitive user activity, making update-level privacy guarantees necessary. While recent work has adapted the classical heavy hitter algorithm Misra-Gries to satisfy differential privacy in the streaming model, the privatization of other heavy hitter algorithms with better empirical utility is absent. Under this observation, we present the first differentially private variant of the SpaceSaving algorithm, which, in the non-private setting, is regarded as the state-of-the-art in practice. Our construction post-processes a non-private SpaceSaving summary by injecting asymptotically optimal noise and applying a carefully calibrated selection rule that suppresses unstable labels. This yields strong privacy guarantees while preserving the empirical advantages of SpaceSaving. Second, we introduce a generic method for extracting heavy hitters from any differentially private frequency oracle in the data stream model. The method requires only O(k) additional memory, where k is the number of heavy items, and provides a mechanism for safely releasing item identities from noisy frequency estimates. This yields an efficient, plug-and-play approach for private heavy hitter recovery from linear sketches. Finally, we conduct an experimental evaluation on synthetic and real-world datasets. Across a wide range of privacy parameters and space budgets, our method provides superior utility to the existing differentially private Misra-Gries algorithm. Our results demonstrate that the empirical superiority of SpaceSaving survives privatization and that efficient, practical heavy hitter identification is achievable under strong differential privacy guarantees.
