Table of Contents
Fetching ...

Learning-Based Heavy Hitters and Flow Frequency Estimation in Streams

Rana Shahout, Michael Mitzenmacher

TL;DR

The paper tackles the challenge of identifying heavy hitters and estimating flow frequencies in streams under tight memory. It introduces Learned Space Saving (LSS), the first learned competing-counter-based approach that augments the Space Saving algorithm with two predictors: one for low-frequency items (LSS-LF) and one for heavy hitters (LSS-HH), plus a higher-throughput variant LSS+. LSS employs a Counting Bloom Filter and a fixed-vs-mutable counter split to maintain robustness against prediction errors, yielding up to substantial gains in top-k precision, heavy-hitter recall, and RMSE for frequency estimation across synthetic, CAIDA IP, and AOL Web data. The framework is supported by theoretical robustness guarantees and extensive experiments, showing LSS can outperform Space Saving under realistic conditions and configurations. The work provides practical guidance for deploying learning-augmented frequency-estimation schemes in high-speed networks and similar streaming contexts, with implications for memory-efficient measurement and detection tasks.

Abstract

Identifying heavy hitters and estimating the frequencies of flows are fundamental tasks in various network domains. Existing approaches to this challenge can broadly be categorized into two groups, hashing-based and competing-counter-based. The Count-Min sketch is a standard example of a hashing-based algorithm, and the Space Saving algorithm is an example of a competing-counter algorithm. Recent works have explored the use of machine learning to enhance algorithms for frequency estimation problems, under the algorithms with prediction framework. However, these works have focused solely on the hashing-based approach, which may not be best for identifying heavy hitters. In this paper, we present the first learned competing-counter-based algorithm, called LSS, for identifying heavy hitters, top k, and flow frequency estimation that utilizes the well-known Space Saving algorithm. We provide theoretical insights into how and to what extent our approach can improve upon Space Saving, backed by experimental results on both synthetic and real-world datasets. Our evaluation demonstrates that LSS can enhance the accuracy and efficiency of Space Saving in identifying heavy hitters, top k, and estimating flow frequencies.

Learning-Based Heavy Hitters and Flow Frequency Estimation in Streams

TL;DR

The paper tackles the challenge of identifying heavy hitters and estimating flow frequencies in streams under tight memory. It introduces Learned Space Saving (LSS), the first learned competing-counter-based approach that augments the Space Saving algorithm with two predictors: one for low-frequency items (LSS-LF) and one for heavy hitters (LSS-HH), plus a higher-throughput variant LSS+. LSS employs a Counting Bloom Filter and a fixed-vs-mutable counter split to maintain robustness against prediction errors, yielding up to substantial gains in top-k precision, heavy-hitter recall, and RMSE for frequency estimation across synthetic, CAIDA IP, and AOL Web data. The framework is supported by theoretical robustness guarantees and extensive experiments, showing LSS can outperform Space Saving under realistic conditions and configurations. The work provides practical guidance for deploying learning-augmented frequency-estimation schemes in high-speed networks and similar streaming contexts, with implications for memory-efficient measurement and detection tasks.

Abstract

Identifying heavy hitters and estimating the frequencies of flows are fundamental tasks in various network domains. Existing approaches to this challenge can broadly be categorized into two groups, hashing-based and competing-counter-based. The Count-Min sketch is a standard example of a hashing-based algorithm, and the Space Saving algorithm is an example of a competing-counter algorithm. Recent works have explored the use of machine learning to enhance algorithms for frequency estimation problems, under the algorithms with prediction framework. However, these works have focused solely on the hashing-based approach, which may not be best for identifying heavy hitters. In this paper, we present the first learned competing-counter-based algorithm, called LSS, for identifying heavy hitters, top k, and flow frequency estimation that utilizes the well-known Space Saving algorithm. We provide theoretical insights into how and to what extent our approach can improve upon Space Saving, backed by experimental results on both synthetic and real-world datasets. Our evaluation demonstrates that LSS can enhance the accuracy and efficiency of Space Saving in identifying heavy hitters, top k, and estimating flow frequencies.

Paper Structure

This paper contains 27 sections, 9 theorems, 4 equations, 8 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Space Saving with $k=\epsilon^{-1}$ counters ensures that after processing $N$ insertions, the minimum count of all monitored items is no more than $\frac{N}{k} = N\epsilon$, i.e, $minCount < N\epsilon$.

Figures (8)

  • Figure 1: In many practical distributions (such as Zipfian, shown here using logarithmic scales on both axes), there are many low-frequency items.
  • Figure 2: Overview of LSS, combining the approaches of LSS-LF and LSS-HH. LSS-LF utilizes a low-frequency predictor to exclude infrequent items. LSS-HH divides entries into fixed and mutable entries, using a heavy hitters predictor to allocate fixed entries for frequent items while processing remaining items through the mutable entries. To mitigate the impact of prediction errors, LSS-LF utilizes a Counting Bloom Filter (CBF), while LSS-HH sets a limit ($k_{hh}$) on the number of fixed entries (represented in green color).
  • Figure 3: An illustration of our algorithms. Consider an input stream $S$ of 11 items as shown, where $A$ is predicted as a heavy hitter, the first arrival of $B$ is predicted (incorrectly) as low frequency, $D, E, F, G, H, I, J, K$ are predicted (correctly) as low-frequency items. In this example, low-frequency items take over the SS counters. In LSS-HH, by allocating a fixed entry for the predicted heavy hitter A, the remaining counters again are taken over by low-frequency items. LSS-LF, however, ensures that low-frequency items do not dominate by filtering them. When $B$ arrives after being previously stored in the filter, it is tracked again. Note that LSS-LF uses fewer counters than SS and LSS-HH to account for the additional memory required for the filter.
  • Figure 4: Histogram of predicted frequencies up to $50$ of the used datasets (Log scale). For Web search and IP datasets, we use the learned model described in Section \ref{['sec:learned_predictor']}, for the Zipf distribution, we used the simulated predictor with $p=0.9$.
  • Figure 5: (a-c) Robustness of LSS using web search dataset (a) precision of top-k ($k=10$) with all predictions as 1 (b) recall of finding heavy hitters when all predictions are heavy hitters (c) precision of top-k vs. prediction accuracy $p$. (d-e) Impact of fixed counters on top-k ($k=64$) and heavy hitters using web search dataset.
  • ...and 3 more figures

Theorems & Definitions (17)

  • Definition 1
  • Definition 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • ...and 7 more