Carbonyl4: A Sketch for Set-Increment Mixed Updates
Yikai Zhao, Yuhan Wu, Tong Yang
TL;DR
This work introduces Carbonyl4, a novel sketch designed for Set-Increment Mixed (SIM) data streams, enabling accurate, unbiased per-key estimates under tight memory. It combines Balance Bucket for local variance optimization with Cascading Overflow to extend precision across the bucket array, achieving near-global variance reduction. To address memory constraints, it offers in-place shrinking via re-sampling and heuristic approaches, maintaining high accuracy for point, subset, and Top-K queries. Extensive experiments on CAIDA, synthetic, Webpage, and Criteo datasets show substantial improvements in MSE, AAE, and recall compared to existing sketches, underscoring Carbonyl4’s practical value for memory-adaptive data-stream analytics.
Abstract
In the realm of data stream processing, the advent of SET-INCREMENT Mixed (SIM) data streams necessitates algorithms that efficiently handle both SET and INCREMENT operations. We present Carbonyl4, an innovative algorithm designed specifically for SIM data streams, ensuring accuracy, unbiasedness, and adaptability. Carbonyl4 introduces two pioneering techniques: the Balance Bucket for refined variance optimization, and the Cascading Overflow for maintaining precision amidst overflow scenarios. Our experiments across four diverse datasets establish Carbonyl4's supremacy over existing algorithms, particularly in terms of accuracy for item-level information retrieval and adaptability to fluctuating memory requirements. The versatility of Carbonyl4 is further demonstrated through its dynamic memory shrinking capability, achieved via a re-sampling and a heuristic approach. The source codes of Carbonyl4 are available at GitHub.
