Table of Contents
Fetching ...

Sampling Space-Saving Set Sketches

Homin K. Lee, Charles Masson

TL;DR

The paper tackles the heavy distinct hitters problem in distributed data streams, where labels with many distinct items must be identified under constant memory. It introduces Sampling Space-Saving Set Sketches (SSSS), a unification of Space-Saving with count-distinct sketches and input sampling, plus a practical HyperLogLog-based variant. The authors prove tight bounds on error and demonstrate that SSSS is invertible, mergeable, constant-memory, and fast to query, while achieving superior accuracy across diverse data sets. Empirical results show SSSS outperforms state-of-the-art sketches in memory efficiency, throughput, and accuracy, especially when merging parallel streams or handling large label sets. The work presents a practically viable solution for real-time monitoring in large-scale distributed systems.

Abstract

Large, distributed data streams are now ubiquitous. High-accuracy sketches with low memory overhead have become the de facto method for analyzing this data. For instance, if we wish to group data by some label and report the largest counts using fixed memory, we need to turn to mergeable heavy hitter sketches that can provide highly accurate approximate counts. Similarly, if we wish to keep track of the number of distinct items in a single set spread across several streams using fixed memory, we can turn to mergeable count distinct sketches that can provide highly accurate set cardinalities. If we were to try to keep track of the cardinality of multiple sets and report only on the largest ones, maintaining individual count distinct sketches for each set can grow unwieldy, especially if the number of sets is not known in advance. We consider the natural combination of the heavy hitters problem with the count distinct problem, the heavy distinct hitters problem: given a stream of $(\ell, x)$ pairs, find all the labels $\ell$ that are paired with a large number of distinct items $x$ using only constant memory. No previous work on heavy distinct hitters has managed to be of practical use in the large, distributed data stream setting. We propose a new algorithm, the Sampling Space-Saving Set Sketch, which combines sketching and sampling techniques and has all the desired properties for size, speed, accuracy, mergeability, and invertibility. We compare our algorithm to several existing solutions to the heavy distinct hitters problem, and provide experimental results across several data sets showing the superiority of the new sketch.

Sampling Space-Saving Set Sketches

TL;DR

The paper tackles the heavy distinct hitters problem in distributed data streams, where labels with many distinct items must be identified under constant memory. It introduces Sampling Space-Saving Set Sketches (SSSS), a unification of Space-Saving with count-distinct sketches and input sampling, plus a practical HyperLogLog-based variant. The authors prove tight bounds on error and demonstrate that SSSS is invertible, mergeable, constant-memory, and fast to query, while achieving superior accuracy across diverse data sets. Empirical results show SSSS outperforms state-of-the-art sketches in memory efficiency, throughput, and accuracy, especially when merging parallel streams or handling large label sets. The work presents a practically viable solution for real-time monitoring in large-scale distributed systems.

Abstract

Large, distributed data streams are now ubiquitous. High-accuracy sketches with low memory overhead have become the de facto method for analyzing this data. For instance, if we wish to group data by some label and report the largest counts using fixed memory, we need to turn to mergeable heavy hitter sketches that can provide highly accurate approximate counts. Similarly, if we wish to keep track of the number of distinct items in a single set spread across several streams using fixed memory, we can turn to mergeable count distinct sketches that can provide highly accurate set cardinalities. If we were to try to keep track of the cardinality of multiple sets and report only on the largest ones, maintaining individual count distinct sketches for each set can grow unwieldy, especially if the number of sets is not known in advance. We consider the natural combination of the heavy hitters problem with the count distinct problem, the heavy distinct hitters problem: given a stream of pairs, find all the labels that are paired with a large number of distinct items using only constant memory. No previous work on heavy distinct hitters has managed to be of practical use in the large, distributed data stream setting. We propose a new algorithm, the Sampling Space-Saving Set Sketch, which combines sketching and sampling techniques and has all the desired properties for size, speed, accuracy, mergeability, and invertibility. We compare our algorithm to several existing solutions to the heavy distinct hitters problem, and provide experimental results across several data sets showing the superiority of the new sketch.
Paper Structure (17 sections, 4 theorems, 10 equations, 5 figures, 5 tables, 5 algorithms)

This paper contains 17 sections, 4 theorems, 10 equations, 5 figures, 5 tables, 5 algorithms.

Key Result

lemma 1

After $m$ insertions, let $d_\ell$ be the number of distinct elements $x_i$ with label $\ell$. Let $\alpha := \min_{j\in S} S[j].\textsc{Distinct}()$, and let $\lvert\{\ell : d_\ell > 0\}\rvert > s$. Then, for Algorithm alg:osss: where the last three properties hold with probability at least $1-\delta_c$ by the strong-tracking property of the count distinct sketch, and $\epsilon$ is its accuracy

Figures (5)

  • Figure 1: A distributed web application, with each pod sending metrics to the monitoring system.
  • Figure 2: Accuracy vs Memory Usage. $NAE(Q_{10}), NAE(Q_{100}), NAE(Q_{1000})$ as defined in Section \ref{['subsec:metrics']} for SSSS, Count-HLL, and SpreadSketch.
  • Figure 3: Top 1000 Query Duration (log scale)
  • Figure 4: Accuracy under merging (log scale)
  • Figure 5: Accuracy on synthetic sets with high overlap (log scale)

Theorems & Definitions (8)

  • lemma 1
  • proof
  • theorem 1
  • proof
  • lemma 2
  • proof
  • theorem 2
  • proof