Table of Contents
Fetching ...

Carbonyl4: A Sketch for Set-Increment Mixed Updates

Yikai Zhao, Yuhan Wu, Tong Yang

TL;DR

This work introduces Carbonyl4, a novel sketch designed for Set-Increment Mixed (SIM) data streams, enabling accurate, unbiased per-key estimates under tight memory. It combines Balance Bucket for local variance optimization with Cascading Overflow to extend precision across the bucket array, achieving near-global variance reduction. To address memory constraints, it offers in-place shrinking via re-sampling and heuristic approaches, maintaining high accuracy for point, subset, and Top-K queries. Extensive experiments on CAIDA, synthetic, Webpage, and Criteo datasets show substantial improvements in MSE, AAE, and recall compared to existing sketches, underscoring Carbonyl4’s practical value for memory-adaptive data-stream analytics.

Abstract

In the realm of data stream processing, the advent of SET-INCREMENT Mixed (SIM) data streams necessitates algorithms that efficiently handle both SET and INCREMENT operations. We present Carbonyl4, an innovative algorithm designed specifically for SIM data streams, ensuring accuracy, unbiasedness, and adaptability. Carbonyl4 introduces two pioneering techniques: the Balance Bucket for refined variance optimization, and the Cascading Overflow for maintaining precision amidst overflow scenarios. Our experiments across four diverse datasets establish Carbonyl4's supremacy over existing algorithms, particularly in terms of accuracy for item-level information retrieval and adaptability to fluctuating memory requirements. The versatility of Carbonyl4 is further demonstrated through its dynamic memory shrinking capability, achieved via a re-sampling and a heuristic approach. The source codes of Carbonyl4 are available at GitHub.

Carbonyl4: A Sketch for Set-Increment Mixed Updates

TL;DR

This work introduces Carbonyl4, a novel sketch designed for Set-Increment Mixed (SIM) data streams, enabling accurate, unbiased per-key estimates under tight memory. It combines Balance Bucket for local variance optimization with Cascading Overflow to extend precision across the bucket array, achieving near-global variance reduction. To address memory constraints, it offers in-place shrinking via re-sampling and heuristic approaches, maintaining high accuracy for point, subset, and Top-K queries. Extensive experiments on CAIDA, synthetic, Webpage, and Criteo datasets show substantial improvements in MSE, AAE, and recall compared to existing sketches, underscoring Carbonyl4’s practical value for memory-adaptive data-stream analytics.

Abstract

In the realm of data stream processing, the advent of SET-INCREMENT Mixed (SIM) data streams necessitates algorithms that efficiently handle both SET and INCREMENT operations. We present Carbonyl4, an innovative algorithm designed specifically for SIM data streams, ensuring accuracy, unbiasedness, and adaptability. Carbonyl4 introduces two pioneering techniques: the Balance Bucket for refined variance optimization, and the Cascading Overflow for maintaining precision amidst overflow scenarios. Our experiments across four diverse datasets establish Carbonyl4's supremacy over existing algorithms, particularly in terms of accuracy for item-level information retrieval and adaptability to fluctuating memory requirements. The versatility of Carbonyl4 is further demonstrated through its dynamic memory shrinking capability, achieved via a re-sampling and a heuristic approach. The source codes of Carbonyl4 are available at GitHub.

Paper Structure

This paper contains 22 sections, 6 theorems, 15 equations, 17 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

The operator $\texttt{MERGE}^{\pm}(\langle e_1, v_1\rangle, \langle e_2, v_2\rangle)$ offers an unbiased estimate for the entries $\langle e_1, v_1\rangle$ and $\langle e_2, v_2\rangle$. The variance of this operator, termed the merge cost, is $2|v_1||v_2|$, which is proven to be optimal among all u

Figures (17)

  • Figure 1: Balance Bucket. Bucket initial containing $d=4$ entries $\langle e_1,v_1\rangle,\cdots,\langle e_4,v_4\rangle$; the right side demonstrates four possible update scenarios when a new update $\langle e,v\rangle$ arrives: there are two possibilities when $|v|\leqslant|v_3|$, and two possibilities when $|v|>|v_3|$.
  • Figure 2: In a Cascading Overflow example: (a) Searching Stage: The process starts with the update $\langle e, 7.8\rangle$ at bucket $B_1$, with $min_{global}$ initially infinite. Step 1:$min_{local} = 28.08$ prompts an update to $min_{global}$, and the search moves to $B_4$ with $\langle e_2, -3.6\rangle$. Step 2:$min_{global}$ becomes $13.68$, and the search transitions to $B_8$ with $\langle e_8, 3.8\rangle$. Step 3: With $min_{local} \geqslant min_{global}$, the search proceeds to $B_6$ with a chance of stopping. Step 4: A new low for $min_{global}$ at $0.123$ leads to $B_3$ with $\langle e_{12}, -0.03\rangle$. Step 5: The search may stop, with $B_{opt}$ identified as $B_{6}$. (b) Kicking Stage: Initiating at $B_1$, the entry $\langle e, 7.8\rangle$ causes a series of displacements across the buckets, ending with a merge in $B_6$.
  • Figure 3: Re-Sampling shrinking. Initially, there are a total of 8 entries (in blue) from two buckets, and after shrinking, there remain 4 entries (in orange) placed in one bucket.
  • Figure 4: Heuristic shrinking. Initially, there are a total of 8 entries (in blue) from two buckets, and after shrinking, there remain 4 entries (in orange) placed in one bucket.
  • Figure 5: Illustration of Time Complexity.
  • ...and 12 more figures

Theorems & Definitions (10)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Definition 4