Count-Min Sketch with Conservative Updates: Worst-Case Analysis

Younes Ben Mazziane; Othmane Marfoq

Count-Min Sketch with Conservative Updates: Worst-Case Analysis

Younes Ben Mazziane, Othmane Marfoq

TL;DR

The paper analyzes CMS-CU for frequency estimation in data streams under a worst-case scenario where items appear at most once. It proves that CMS-CU with uniform counters selection corresponds to this worst-case regime and, in the important regime $d=m-1$, that the average estimation error and the average counter rate converge to $1/2$ while the counter gap remains bounded. A Markov-chain framework is developed to derive tight finite-time lower/upper bounds on the average error, parameterized by a gap limit $g$ and with state spaces of size $\binom{m+g-d}{g}$; these bounds converge to the true limit as $T\to\infty$, and in the key case $d=m-1,g=1$ they coincide in the large-$m$ limit, yielding exact results $\ell_1(\infty)=(m-1)/(2m-1)$ and $U_1(\infty)=m/(2m-1)$, both approaching $1/2$. The approach enables efficient computation, with complexity $\mathcal{O}((T+g)m\binom{m+g-d}{g})$, and demonstrates high accuracy of the bounds for small $g$ in practical settings (e.g., $m=50,d=4,g=5$). Overall, the work provides rigorous worst-case performance guarantees for CMS-CU and practical tools to bound estimation error in large-scale data streams.

Abstract

Count-Min Sketch with Conservative Updates (CMS-CU) is a memory-efficient hash-based data structure used to estimate the occurrences of items within a data stream. CMS-CU stores $m$ counters and employs $d$ hash functions to map items to these counters. We first argue that the estimation error in CMS-CU is maximal when each item appears at most once in the stream. Next, we study CMS-CU in this setting. In the case where $d=m-1$, we prove that the average estimation error and the average counter rate converge almost surely to $\frac{1}{2}$, contrasting with the vanilla Count-Min Sketch, where the average counter rate is equal to $\frac{m-1}{m}$. For any given $m$ and $d$, we prove novel lower and upper bounds on the average estimation error, incorporating a positive integer parameter $g$. Larger values of this parameter improve the accuracy of the bounds. Moreover, the computation of each bound involves examining an ergodic Markov process with a state space of size $\binom{m+g-d}{g}$ and a sparse transition probabilities matrix containing $\mathcal{O}(m\binom{m+g-d}{g})$ non-zero entries. For $d=m-1$, $g=1$, and as $m\to \infty$, we show that the lower and upper bounds coincide. In general, our bounds exhibit high accuracy for small values of $g$, as shown by numerical computation. For example, for $m=50$, $d=4$, and $g=5$, the difference between the lower and upper bounds is smaller than $10^{-4}$.

Count-Min Sketch with Conservative Updates: Worst-Case Analysis

TL;DR

, that the average estimation error and the average counter rate converge to

while the counter gap remains bounded. A Markov-chain framework is developed to derive tight finite-time lower/upper bounds on the average error, parameterized by a gap limit

and with state spaces of size

; these bounds converge to the true limit as

, and in the key case

they coincide in the large-

limit, yielding exact results

and

, both approaching

. The approach enables efficient computation, with complexity

, and demonstrates high accuracy of the bounds for small

in practical settings (e.g.,

). Overall, the work provides rigorous worst-case performance guarantees for CMS-CU and practical tools to bound estimation error in large-scale data streams.

Abstract

Count-Min Sketch with Conservative Updates (CMS-CU) is a memory-efficient hash-based data structure used to estimate the occurrences of items within a data stream. CMS-CU stores

counters and employs

hash functions to map items to these counters. We first argue that the estimation error in CMS-CU is maximal when each item appears at most once in the stream. Next, we study CMS-CU in this setting. In the case where

, we prove that the average estimation error and the average counter rate converge almost surely to

, contrasting with the vanilla Count-Min Sketch, where the average counter rate is equal to

. For any given

and

, we prove novel lower and upper bounds on the average estimation error, incorporating a positive integer parameter

. Larger values of this parameter improve the accuracy of the bounds. Moreover, the computation of each bound involves examining an ergodic Markov process with a state space of size

and a sparse transition probabilities matrix containing

non-zero entries. For

, and as

, we show that the lower and upper bounds coincide. In general, our bounds exhibit high accuracy for small values of

, as shown by numerical computation. For example, for

, and

, the difference between the lower and upper bounds is smaller than

Paper Structure (14 sections, 11 theorems, 43 equations, 1 table, 2 algorithms)

This paper contains 14 sections, 11 theorems, 43 equations, 1 table, 2 algorithms.

Introduction
Previous results
Contributions
Notation and Problem Formulation
Related Work
Uniform Counters Selection Assumption
Exact Computation
Lower and Upper Bounds
Conclusion
Useful Lemma
Proof of Proposition \ref{['prop:WorstErrorItemNotReq']}.
Proof of Theorem \ref{['th:main-particular']}
Proof of Lemma \ref{['lem:monotony']}.
Proof of Properties \ref{['eq:complexity']} and \ref{['eq:sandwich_proba']} in Th. \ref{['th:main']}

Key Result

Proposition 4.1

Under Assumption assum:IdealHash, for any item $i$ and stream $\bm{r}$, the following inequality holds: $\mathbb{E}\left[e_{i}(t,\bm{r})\right] \leq \mathbb{E}\left[e^{*}(t,\bm{r})\right]$.

Theorems & Definitions (18)

Conjecture 4.1
Proposition 4.1
Theorem 5.1
proof : Sketch proof
Proposition 5.1
Theorem 6.1
proof : Sketch proof of Properties \ref{['eq:stationary']} and \ref{['eq:sandwich_expectation']} in Th. \ref{['th:main']}
Lemma 6.1
proof : Sketch proof of Properties \ref{['eq:complexity']} and \ref{['eq:sandwich_proba']} in Th. \ref{['th:main']}
Lemma 6.2
...and 8 more

Count-Min Sketch with Conservative Updates: Worst-Case Analysis

TL;DR

Abstract

Count-Min Sketch with Conservative Updates: Worst-Case Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (18)