Table of Contents
Fetching ...

Sublime: Sublinear Error & Space for Unbounded Skewed Streams

Navid Eslami, Ioana O. Bercea, Rasmus Pagh, Niv Dayan

Abstract

Modern stream processing systems must often track the frequency of distinct keys in a data stream in real-time. Since monitoring the exact counts often entails a prohibitive memory footprint, many applications rely on compact, probabilistic data structures called frequency estimation sketches to approximate them. However, mainstream frequency estimation sketches fall short in two critical aspects: (1) They are memory-inefficient under data skew. This is because they use uniformly-sized counters to track the key counts and thus waste memory on storing the leading zeros of many small counter values. (2) Their estimation error deteriorates at least linearly with the stream's length, which may grow indefinitely over time. This is because they count the keys using a fixed number~of~counters. We present Sublime, a framework that generalizes frequency estimation sketches to address these problems by dynamically adapting to the stream's skew and length. To save memory under skew, Sublime uses short counters upfront and elongates them with extensions stored within the same cache line as they overflow. It leverages novel bit manipulation routines to quickly access a counter's extension. It also controls the scaling of its error rate by expanding its number of approximate counters as the stream grows. We apply Sublime to Count-Min Sketch and Count Sketch. We show, theoretically and empirically, that Sublime significantly improves accuracy and memory over the state of the art while maintaining competitive or superior performance.

Sublime: Sublinear Error & Space for Unbounded Skewed Streams

Abstract

Modern stream processing systems must often track the frequency of distinct keys in a data stream in real-time. Since monitoring the exact counts often entails a prohibitive memory footprint, many applications rely on compact, probabilistic data structures called frequency estimation sketches to approximate them. However, mainstream frequency estimation sketches fall short in two critical aspects: (1) They are memory-inefficient under data skew. This is because they use uniformly-sized counters to track the key counts and thus waste memory on storing the leading zeros of many small counter values. (2) Their estimation error deteriorates at least linearly with the stream's length, which may grow indefinitely over time. This is because they count the keys using a fixed number~of~counters. We present Sublime, a framework that generalizes frequency estimation sketches to address these problems by dynamically adapting to the stream's skew and length. To save memory under skew, Sublime uses short counters upfront and elongates them with extensions stored within the same cache line as they overflow. It leverages novel bit manipulation routines to quickly access a counter's extension. It also controls the scaling of its error rate by expanding its number of approximate counters as the stream grows. We apply Sublime to Count-Min Sketch and Count Sketch. We show, theoretically and empirically, that Sublime significantly improves accuracy and memory over the state of the art while maintaining competitive or superior performance.
Paper Structure (19 sections, 17 theorems, 13 equations, 17 figures, 4 tables)

This paper contains 19 sections, 17 theorems, 13 equations, 17 figures, 4 tables.

Key Result

Theorem 4.1

When using a size function $W(\cdot)$ and a single array on a stream with a total key count of $N$, SublimeCMS provides an estimation error $E(N)$ satisfying Compared to a fixed-size CMS allocated upfront with an array of $W(N)$ counters with knowledge of $N$, the expected error terms above are higher by at most a constant and a logarithmic factor.

Figures (17)

  • Figure 1: CMS maintains $d$ counter arrays of size $w$ and hashes keys into them to estimate their frequencies. Insertions are illustrated on the left and queries on the right.
  • Figure 2: Counter sharing and counter merging encode CMS's counters in less space in exchange for blowing up some counter values. Here, the bits in the binary representations are in increasing order of significance from left to right.
  • Figure 3: Under the same memory budget as a CMS instance with 32-bit counters, counter sharing and counter merging can lead to lower accuracy when processing a growing stream.
  • Figure 3: Tuning VALE's parameters for each plot point to the left according to the table above enables high performance and accuracy for SublimeCMS.
  • Figure 4: VALE encodes a chunk of $c$ counters in a cache line. It encodes the $s$ lower-order bits of each counter in a stub and stores its remaining higher-order bits in extensions comprised of 2-bit fragments representing base-3 digits. Here, we use a stub length of $s=6$.
  • ...and 12 more figures

Theorems & Definitions (18)

  • Theorem 4.1
  • Theorem 4.2
  • Lemma 5.1
  • Theorem 5.2
  • Theorem 5.3
  • Theorem 5.4
  • Theorem A.1
  • Theorem A.2
  • Lemma B.1
  • Theorem B.1
  • ...and 8 more