Sampling Space-Saving Set Sketches

Homin K. Lee; Charles Masson

Sampling Space-Saving Set Sketches

Homin K. Lee, Charles Masson

TL;DR

The paper tackles the heavy distinct hitters problem in distributed data streams, where labels with many distinct items must be identified under constant memory. It introduces Sampling Space-Saving Set Sketches (SSSS), a unification of Space-Saving with count-distinct sketches and input sampling, plus a practical HyperLogLog-based variant. The authors prove tight bounds on error and demonstrate that SSSS is invertible, mergeable, constant-memory, and fast to query, while achieving superior accuracy across diverse data sets. Empirical results show SSSS outperforms state-of-the-art sketches in memory efficiency, throughput, and accuracy, especially when merging parallel streams or handling large label sets. The work presents a practically viable solution for real-time monitoring in large-scale distributed systems.

Abstract

Large, distributed data streams are now ubiquitous. High-accuracy sketches with low memory overhead have become the de facto method for analyzing this data. For instance, if we wish to group data by some label and report the largest counts using fixed memory, we need to turn to mergeable heavy hitter sketches that can provide highly accurate approximate counts. Similarly, if we wish to keep track of the number of distinct items in a single set spread across several streams using fixed memory, we can turn to mergeable count distinct sketches that can provide highly accurate set cardinalities. If we were to try to keep track of the cardinality of multiple sets and report only on the largest ones, maintaining individual count distinct sketches for each set can grow unwieldy, especially if the number of sets is not known in advance. We consider the natural combination of the heavy hitters problem with the count distinct problem, the heavy distinct hitters problem: given a stream of $(\ell, x)$ pairs, find all the labels $\ell$ that are paired with a large number of distinct items $x$ using only constant memory. No previous work on heavy distinct hitters has managed to be of practical use in the large, distributed data stream setting. We propose a new algorithm, the Sampling Space-Saving Set Sketch, which combines sketching and sampling techniques and has all the desired properties for size, speed, accuracy, mergeability, and invertibility. We compare our algorithm to several existing solutions to the heavy distinct hitters problem, and provide experimental results across several data sets showing the superiority of the new sketch.

Sampling Space-Saving Set Sketches

TL;DR

Abstract

pairs, find all the labels

that are paired with a large number of distinct items

using only constant memory. No previous work on heavy distinct hitters has managed to be of practical use in the large, distributed data stream setting. We propose a new algorithm, the Sampling Space-Saving Set Sketch, which combines sketching and sampling techniques and has all the desired properties for size, speed, accuracy, mergeability, and invertibility. We compare our algorithm to several existing solutions to the heavy distinct hitters problem, and provide experimental results across several data sets showing the superiority of the new sketch.

Paper Structure (17 sections, 4 theorems, 10 equations, 5 figures, 5 tables, 5 algorithms)

This paper contains 17 sections, 4 theorems, 10 equations, 5 figures, 5 tables, 5 algorithms.

Introduction
Related Work
Count Distinct
Heavy Hitters
Heavy Distinct Hitters
Space-Saving Set Sketches
Sampling Space-Saving Set Sketches
Recycling the Count Distinct Sketches
Sampling the Input
Practical Implementation
Mergeability
Experiments
Metrics
Configuration
Data
...and 2 more sections

Key Result

lemma 1

After $m$ insertions, let $d_\ell$ be the number of distinct elements $x_i$ with label $\ell$. Let $\alpha := \min_{j\in S} S[j].\textsc{Distinct}()$, and let $\lvert\{\ell : d_\ell > 0\}\rvert > s$. Then, for Algorithm alg:osss: where the last three properties hold with probability at least $1-\delta_c$ by the strong-tracking property of the count distinct sketch, and $\epsilon$ is its accuracy

Figures (5)

Figure 1: A distributed web application, with each pod sending metrics to the monitoring system.
Figure 2: Accuracy vs Memory Usage. $NAE(Q_{10}), NAE(Q_{100}), NAE(Q_{1000})$ as defined in Section \ref{['subsec:metrics']} for SSSS, Count-HLL, and SpreadSketch.
Figure 3: Top 1000 Query Duration (log scale)
Figure 4: Accuracy under merging (log scale)
Figure 5: Accuracy on synthetic sets with high overlap (log scale)

Theorems & Definitions (8)

lemma 1
proof
theorem 1
proof
lemma 2
proof
theorem 2
proof

Sampling Space-Saving Set Sketches

TL;DR

Abstract

Sampling Space-Saving Set Sketches

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (8)