The SpaceSaving$\pm$ Family of Algorithms for Data Streams with Bounded Deletions
Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi, Claire Mathieu, Ahmed Metwally, Michel de Rougemont
TL;DR
This work extends the SpaceSaving$\pm$ framework to data streams with bounded deletions and interleaved insertions/deletions, addressing a gap where prior SpaceSaving$\pm$ approaches required non-interleaved streams. It introduces three algorithms—Double SpaceSaving$\pm$, Unbiased Double SpaceSaving$\pm$, and Integrated SpaceSaving$\pm$—all operating in $O\left(\frac{\alpha}{\epsilon}\right)$ space and providing deterministic or unbiased frequency estimates and heavy hitters with provable residual and, under mild skew assumptions, relative error guarantees. The authors also prove mergeability for distributed execution and present tighter, residual-based analyses that tighten error bounds compared to insertion-only models. Empirical results on Zipfian and YCSB-like workloads show that Integrated SpaceSaving$\pm$ often delivers the best accuracy in frequency estimation, while all proposed methods outperform traditional linear sketches for large-scale, memory-constrained settings in identifying heavy hitters. Overall, the SpaceSaving$\pm$ family offers practical, scalable tools for real-time frequency analysis in streaming systems with deletions and interleaving updates.
Abstract
In this paper, we present an advanced analysis of near optimal algorithms that use limited space to solve the frequency estimation, heavy hitters, frequent items, and top-k approximation in the bounded deletion model. We define the family of SpaceSaving$\pm$ algorithms and explain why the original SpaceSaving$\pm$ algorithm only works when insertions and deletions are not interleaved. Next, we propose the new Double SpaceSaving$\pm$, Unbiased Double SpaceSaving$\pm$, and Integrated SpaceSaving$\pm$ and prove their correctness. The three proposed algorithms represent different trade-offs, in which Double SpaceSaving$\pm$ can be extended to provide unbiased estimations while Integrated SpaceSaving$\pm$ uses less space. Since data streams are often skewed, we present an improved analysis of these algorithms and show that errors do not depend on the hot items. We also demonstrate how to achieve relative error guarantees under mild assumptions. Moreover, we establish that the important mergeability property is satisfied by all three algorithms, which is essential for running the algorithms in distributed settings.
