Table of Contents
Fetching ...

Count-Min Sketch with Conservative Updates: Worst-Case Analysis

Younes Ben Mazziane, Othmane Marfoq

TL;DR

The paper analyzes CMS-CU for frequency estimation in data streams under a worst-case scenario where items appear at most once. It proves that CMS-CU with uniform counters selection corresponds to this worst-case regime and, in the important regime $d=m-1$, that the average estimation error and the average counter rate converge to $1/2$ while the counter gap remains bounded. A Markov-chain framework is developed to derive tight finite-time lower/upper bounds on the average error, parameterized by a gap limit $g$ and with state spaces of size $\binom{m+g-d}{g}$; these bounds converge to the true limit as $T\to\infty$, and in the key case $d=m-1,g=1$ they coincide in the large-$m$ limit, yielding exact results $\ell_1(\infty)=(m-1)/(2m-1)$ and $U_1(\infty)=m/(2m-1)$, both approaching $1/2$. The approach enables efficient computation, with complexity $\mathcal{O}((T+g)m\binom{m+g-d}{g})$, and demonstrates high accuracy of the bounds for small $g$ in practical settings (e.g., $m=50,d=4,g=5$). Overall, the work provides rigorous worst-case performance guarantees for CMS-CU and practical tools to bound estimation error in large-scale data streams.

Abstract

Count-Min Sketch with Conservative Updates (CMS-CU) is a memory-efficient hash-based data structure used to estimate the occurrences of items within a data stream. CMS-CU stores $m$ counters and employs $d$ hash functions to map items to these counters. We first argue that the estimation error in CMS-CU is maximal when each item appears at most once in the stream. Next, we study CMS-CU in this setting. In the case where $d=m-1$, we prove that the average estimation error and the average counter rate converge almost surely to $\frac{1}{2}$, contrasting with the vanilla Count-Min Sketch, where the average counter rate is equal to $\frac{m-1}{m}$. For any given $m$ and $d$, we prove novel lower and upper bounds on the average estimation error, incorporating a positive integer parameter $g$. Larger values of this parameter improve the accuracy of the bounds. Moreover, the computation of each bound involves examining an ergodic Markov process with a state space of size $\binom{m+g-d}{g}$ and a sparse transition probabilities matrix containing $\mathcal{O}(m\binom{m+g-d}{g})$ non-zero entries. For $d=m-1$, $g=1$, and as $m\to \infty$, we show that the lower and upper bounds coincide. In general, our bounds exhibit high accuracy for small values of $g$, as shown by numerical computation. For example, for $m=50$, $d=4$, and $g=5$, the difference between the lower and upper bounds is smaller than $10^{-4}$.

Count-Min Sketch with Conservative Updates: Worst-Case Analysis

TL;DR

The paper analyzes CMS-CU for frequency estimation in data streams under a worst-case scenario where items appear at most once. It proves that CMS-CU with uniform counters selection corresponds to this worst-case regime and, in the important regime , that the average estimation error and the average counter rate converge to while the counter gap remains bounded. A Markov-chain framework is developed to derive tight finite-time lower/upper bounds on the average error, parameterized by a gap limit and with state spaces of size ; these bounds converge to the true limit as , and in the key case they coincide in the large- limit, yielding exact results and , both approaching . The approach enables efficient computation, with complexity , and demonstrates high accuracy of the bounds for small in practical settings (e.g., ). Overall, the work provides rigorous worst-case performance guarantees for CMS-CU and practical tools to bound estimation error in large-scale data streams.

Abstract

Count-Min Sketch with Conservative Updates (CMS-CU) is a memory-efficient hash-based data structure used to estimate the occurrences of items within a data stream. CMS-CU stores counters and employs hash functions to map items to these counters. We first argue that the estimation error in CMS-CU is maximal when each item appears at most once in the stream. Next, we study CMS-CU in this setting. In the case where , we prove that the average estimation error and the average counter rate converge almost surely to , contrasting with the vanilla Count-Min Sketch, where the average counter rate is equal to . For any given and , we prove novel lower and upper bounds on the average estimation error, incorporating a positive integer parameter . Larger values of this parameter improve the accuracy of the bounds. Moreover, the computation of each bound involves examining an ergodic Markov process with a state space of size and a sparse transition probabilities matrix containing non-zero entries. For , , and as , we show that the lower and upper bounds coincide. In general, our bounds exhibit high accuracy for small values of , as shown by numerical computation. For example, for , , and , the difference between the lower and upper bounds is smaller than .
Paper Structure (14 sections, 11 theorems, 43 equations, 1 table, 2 algorithms)

This paper contains 14 sections, 11 theorems, 43 equations, 1 table, 2 algorithms.

Key Result

Proposition 4.1

Under Assumption assum:IdealHash, for any item $i$ and stream $\bm{r}$, the following inequality holds: $\mathbb{E}\left[e_{i}(t,\bm{r})\right] \leq \mathbb{E}\left[e^{*}(t,\bm{r})\right]$.

Theorems & Definitions (18)

  • Conjecture 4.1
  • Proposition 4.1
  • Theorem 5.1
  • proof : Sketch proof
  • Proposition 5.1
  • Theorem 6.1
  • proof : Sketch proof of Properties \ref{['eq:stationary']} and \ref{['eq:sandwich_expectation']} in Th. \ref{['th:main']}
  • Lemma 6.1
  • proof : Sketch proof of Properties \ref{['eq:complexity']} and \ref{['eq:sandwich_proba']} in Th. \ref{['th:main']}
  • Lemma 6.2
  • ...and 8 more