Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth
Xinyu Yuan, Yan Qiao, Meng Li, Zhenchun Wei, Cuiying Feng, Zonghui Wang, Wenzhi Chen
TL;DR
This work tackles accurate per-key frequency estimation in high-speed data streams under strict memory. It introduces UCL-sketch, a GT-free, online-learning sketch that recovers frequencies directly from sketch counters using an equivalent-learning-based solver and a scalable, bucketed architecture with shared parameters. Theoretical guarantees and extensive experiments show near-oracle accuracy under tight memory and substantial speedups over traditional equation-based solvers, confirming practical viability for real-time analytics. The approach blends compressive sensing with self-supervised learning and a scalable data/control-plane design, enabling rapid adaptation to changing distributions and large key spaces, with code publicly available for further research.
Abstract
Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketches provide only coarse estimates under strict memory constraints. Although some learning-augmented methods have emerged recently, they typically rely on offline training with real frequencies or/and labels, which are often unavailable. Moreover, these methods suffer from slow update speeds, limiting their suitability for real-time processing despite offering only marginal accuracy improvements. To overcome these challenges, we propose UCL-sketch, a practical learning-based paradigm for per-key frequency estimation. Our design introduces two key innovations: (i) an online training mechanism based on equivalent learning that requires no ground truth (GT), and (ii) a highly scalable architecture leveraging logically structured estimation buckets to scale to real-world data stream. The UCL-sketch, which utilizes compressive sensing (CS), converges to an estimator that provably yields a error bound far lower than that of prior works, without sacrificing the speed of processing. Extensive experiments on both real-world and synthetic datasets demonstrate that our approach outperforms previously proposed approaches regarding per-key accuracy and distribution. Notably, under extremely tight memory budgets, its quality almost matches that of an (infeasible) omniscient oracle. Moreover, compared to the existing equation-based sketch, UCL-sketch achieves an average decoding speedup of nearly 500 times. To help further research and development, our code is publicly available at https://github.com/Y-debug-sys/UCL-sketch.
