Table of Contents
Fetching ...

QSketch: An Efficient Sketch for Weighted Cardinality Estimation in Streams

Yiyan Qi, Rundong Li, Pinghui Wang, Yufang Sun, Rui Xing

TL;DR

This paper addresses weighted cardinality estimation over data streams, a problem with limited existing scalable solutions. It introduces QSketch, a quantization-based sketch that compresses register values to small integers (5–8 bits) and updates in near-constant time, while providing a carefully derived MLE-based estimator for $C^{(t)}$ and a dynamic variant, QSketch-Dyn, for real-time tracking. Empirical results on synthetic and real datasets show QSketch achieves comparable accuracy to state-of-the-art methods at ~1/8 the memory, with QSketch-Dyn delivering the best accuracy and throughput, including substantial improvements on large-scale streams such as CAIDA. The work demonstrates practical gains in memory efficiency and speed for weighted cardinality estimation in streaming systems, with potential extensions to deletions and negative weights in future work.

Abstract

Estimating cardinality, i.e., the number of distinct elements, of a data stream is a fundamental problem in areas like databases, computer networks, and information retrieval. This study delves into a broader scenario where each element carries a positive weight. Unlike traditional cardinality estimation, limited research exists on weighted cardinality, with current methods requiring substantial memory and computational resources, challenging for devices with limited capabilities and real-time applications like anomaly detection. To address these issues, we propose QSketch, a memory-efficient sketch method for estimating weighted cardinality in streams. QSketch uses a quantization technique to condense continuous variables into a compact set of integer variables, with each variable requiring only 8 bits, making it 8 times smaller than previous methods. Furthermore, we leverage dynamic properties during QSketch generation to significantly enhance estimation accuracy and achieve a lower time complexity of $O(1)$ for updating estimations upon encountering a new element. Experimental results on synthetic and real-world datasets show that QSketch is approximately 30\% more accurate and two orders of magnitude faster than the state-of-the-art, using only $1/8$ of the memory.

QSketch: An Efficient Sketch for Weighted Cardinality Estimation in Streams

TL;DR

This paper addresses weighted cardinality estimation over data streams, a problem with limited existing scalable solutions. It introduces QSketch, a quantization-based sketch that compresses register values to small integers (5–8 bits) and updates in near-constant time, while providing a carefully derived MLE-based estimator for and a dynamic variant, QSketch-Dyn, for real-time tracking. Empirical results on synthetic and real datasets show QSketch achieves comparable accuracy to state-of-the-art methods at ~1/8 the memory, with QSketch-Dyn delivering the best accuracy and throughput, including substantial improvements on large-scale streams such as CAIDA. The work demonstrates practical gains in memory efficiency and speed for weighted cardinality estimation in streaming systems, with potential extensions to deletions and negative weights in future work.

Abstract

Estimating cardinality, i.e., the number of distinct elements, of a data stream is a fundamental problem in areas like databases, computer networks, and information retrieval. This study delves into a broader scenario where each element carries a positive weight. Unlike traditional cardinality estimation, limited research exists on weighted cardinality, with current methods requiring substantial memory and computational resources, challenging for devices with limited capabilities and real-time applications like anomaly detection. To address these issues, we propose QSketch, a memory-efficient sketch method for estimating weighted cardinality in streams. QSketch uses a quantization technique to condense continuous variables into a compact set of integer variables, with each variable requiring only 8 bits, making it 8 times smaller than previous methods. Furthermore, we leverage dynamic properties during QSketch generation to significantly enhance estimation accuracy and achieve a lower time complexity of for updating estimations upon encountering a new element. Experimental results on synthetic and real-world datasets show that QSketch is approximately 30\% more accurate and two orders of magnitude faster than the state-of-the-art, using only of the memory.
Paper Structure (28 sections, 2 theorems, 36 equations, 10 figures, 1 table, 4 algorithms)

This paper contains 28 sections, 2 theorems, 36 equations, 10 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

Let $0 < \varepsilon \ll 1$ be a small positive value. Given a sketch of $m$ registers with minimal value $r_\text{min}$ and maximal value $r_\text{max}$, when $- 2^{(r_\text{min}+1)}\cdot\ln{\epsilon} < C_{\Pi} < -2^{r_\text{max}} \ln(1-\epsilon)$, the register values are not in the discrete set $\

Figures (10)

  • Figure 1: Basic idea of QSketch
  • Figure 2: Accuracy of all methods under different numbers of registers on real-world datasets.
  • Figure 3: Accuracy of all methods under different numbers of registers on synthetic datasets.
  • Figure 4: Accuracy of all methods under different data sizes on synthetic datasets.
  • Figure 5: Accuracy of our methods QSketch and QSketch-Dyn under different register sizes on synthetic datasets.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2