Table of Contents
Fetching ...

Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

Xinyu Yuan, Yan Qiao, Meng Li, Zhenchun Wei, Cuiying Feng, Zonghui Wang, Wenzhi Chen

TL;DR

This work tackles accurate per-key frequency estimation in high-speed data streams under strict memory. It introduces UCL-sketch, a GT-free, online-learning sketch that recovers frequencies directly from sketch counters using an equivalent-learning-based solver and a scalable, bucketed architecture with shared parameters. Theoretical guarantees and extensive experiments show near-oracle accuracy under tight memory and substantial speedups over traditional equation-based solvers, confirming practical viability for real-time analytics. The approach blends compressive sensing with self-supervised learning and a scalable data/control-plane design, enabling rapid adaptation to changing distributions and large key spaces, with code publicly available for further research.

Abstract

Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketches provide only coarse estimates under strict memory constraints. Although some learning-augmented methods have emerged recently, they typically rely on offline training with real frequencies or/and labels, which are often unavailable. Moreover, these methods suffer from slow update speeds, limiting their suitability for real-time processing despite offering only marginal accuracy improvements. To overcome these challenges, we propose UCL-sketch, a practical learning-based paradigm for per-key frequency estimation. Our design introduces two key innovations: (i) an online training mechanism based on equivalent learning that requires no ground truth (GT), and (ii) a highly scalable architecture leveraging logically structured estimation buckets to scale to real-world data stream. The UCL-sketch, which utilizes compressive sensing (CS), converges to an estimator that provably yields a error bound far lower than that of prior works, without sacrificing the speed of processing. Extensive experiments on both real-world and synthetic datasets demonstrate that our approach outperforms previously proposed approaches regarding per-key accuracy and distribution. Notably, under extremely tight memory budgets, its quality almost matches that of an (infeasible) omniscient oracle. Moreover, compared to the existing equation-based sketch, UCL-sketch achieves an average decoding speedup of nearly 500 times. To help further research and development, our code is publicly available at https://github.com/Y-debug-sys/UCL-sketch.

Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

TL;DR

This work tackles accurate per-key frequency estimation in high-speed data streams under strict memory. It introduces UCL-sketch, a GT-free, online-learning sketch that recovers frequencies directly from sketch counters using an equivalent-learning-based solver and a scalable, bucketed architecture with shared parameters. Theoretical guarantees and extensive experiments show near-oracle accuracy under tight memory and substantial speedups over traditional equation-based solvers, confirming practical viability for real-time analytics. The approach blends compressive sensing with self-supervised learning and a scalable data/control-plane design, enabling rapid adaptation to changing distributions and large key spaces, with code publicly available for further research.

Abstract

Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketches provide only coarse estimates under strict memory constraints. Although some learning-augmented methods have emerged recently, they typically rely on offline training with real frequencies or/and labels, which are often unavailable. Moreover, these methods suffer from slow update speeds, limiting their suitability for real-time processing despite offering only marginal accuracy improvements. To overcome these challenges, we propose UCL-sketch, a practical learning-based paradigm for per-key frequency estimation. Our design introduces two key innovations: (i) an online training mechanism based on equivalent learning that requires no ground truth (GT), and (ii) a highly scalable architecture leveraging logically structured estimation buckets to scale to real-world data stream. The UCL-sketch, which utilizes compressive sensing (CS), converges to an estimator that provably yields a error bound far lower than that of prior works, without sacrificing the speed of processing. Extensive experiments on both real-world and synthetic datasets demonstrate that our approach outperforms previously proposed approaches regarding per-key accuracy and distribution. Notably, under extremely tight memory budgets, its quality almost matches that of an (infeasible) omniscient oracle. Moreover, compared to the existing equation-based sketch, UCL-sketch achieves an average decoding speedup of nearly 500 times. To help further research and development, our code is publicly available at https://github.com/Y-debug-sys/UCL-sketch.

Paper Structure

This paper contains 37 sections, 39 equations, 25 figures, 11 tables, 1 algorithm.

Figures (25)

  • Figure 1: Comparison between the previous learning-augmented sketches and our studied learning-based sketch: In our approach, we empower the sketch with learning technologies in the recovery phase to improve streaming throughput. The model is online trained using just compressed counters in the sketch, which is much more practical and efficient than the prior works.
  • Figure 2: The overall processing framework of equation-based sketch: In the data plane, it builds a local sketch to record the data stream and a key tracking mechanism for new item identification and reporting. After the centralized server receives sketch counters and keys from the monitor device, the control plane can recover the frequencies through solving an under-constrained equation system.
  • Figure 3: The basic design of UCL-sketch: (a) UCL-sketch maintains a hash table, a CM-sketch, and a Bloom filter in the data plane, while deploying two key sets in the control plane (storing keys evicted from the hash table and other keys, respectively), along with a learning-based solver. (b) During stream processing in the data plane, each incoming item is first checked against the hash table; if the insertion fails, the item is inserted into the CM-sketch instead, and its key is reported to the control plane if it was either recently evicted from the hash table or not found in the Bloom filter.
  • Figure 4: Main ideas of training strategy: (a) The unbounded set of frequency vectors is tolerant of certain Zipfian transformations. (b) Continually adapting the model using sampled counters in a sliding window for online training. (c) The learned solver $y \rightarrow x$ should also be invariant to these natural transformations. (d) Illustration of our self-supervised equivalent loss design.
  • Figure 5: Expansion design of a learned solver:Left:Redesigning. Whenever a new key emerges, the entire output layer learned on previous streams is replaced and retrained. Center:Partial-redesigning with buckets. The keys are separated into independent buckets, so the solver is affected only by newly updated bucket. Right:Non-redesigning with logical buckets. By sharing buckets, the solver can be adapted to varying key space, and greatly reduces the number of parameters.
  • ...and 20 more figures