Count-Min Sketch with Conservative Updates: Worst-Case Analysis
Younes Ben Mazziane, Othmane Marfoq
TL;DR
The paper analyzes CMS-CU for frequency estimation in data streams under a worst-case scenario where items appear at most once. It proves that CMS-CU with uniform counters selection corresponds to this worst-case regime and, in the important regime $d=m-1$, that the average estimation error and the average counter rate converge to $1/2$ while the counter gap remains bounded. A Markov-chain framework is developed to derive tight finite-time lower/upper bounds on the average error, parameterized by a gap limit $g$ and with state spaces of size $\binom{m+g-d}{g}$; these bounds converge to the true limit as $T\to\infty$, and in the key case $d=m-1,g=1$ they coincide in the large-$m$ limit, yielding exact results $\ell_1(\infty)=(m-1)/(2m-1)$ and $U_1(\infty)=m/(2m-1)$, both approaching $1/2$. The approach enables efficient computation, with complexity $\mathcal{O}((T+g)m\binom{m+g-d}{g})$, and demonstrates high accuracy of the bounds for small $g$ in practical settings (e.g., $m=50,d=4,g=5$). Overall, the work provides rigorous worst-case performance guarantees for CMS-CU and practical tools to bound estimation error in large-scale data streams.
Abstract
Count-Min Sketch with Conservative Updates (CMS-CU) is a memory-efficient hash-based data structure used to estimate the occurrences of items within a data stream. CMS-CU stores $m$ counters and employs $d$ hash functions to map items to these counters. We first argue that the estimation error in CMS-CU is maximal when each item appears at most once in the stream. Next, we study CMS-CU in this setting. In the case where $d=m-1$, we prove that the average estimation error and the average counter rate converge almost surely to $\frac{1}{2}$, contrasting with the vanilla Count-Min Sketch, where the average counter rate is equal to $\frac{m-1}{m}$. For any given $m$ and $d$, we prove novel lower and upper bounds on the average estimation error, incorporating a positive integer parameter $g$. Larger values of this parameter improve the accuracy of the bounds. Moreover, the computation of each bound involves examining an ergodic Markov process with a state space of size $\binom{m+g-d}{g}$ and a sparse transition probabilities matrix containing $\mathcal{O}(m\binom{m+g-d}{g})$ non-zero entries. For $d=m-1$, $g=1$, and as $m\to \infty$, we show that the lower and upper bounds coincide. In general, our bounds exhibit high accuracy for small values of $g$, as shown by numerical computation. For example, for $m=50$, $d=4$, and $g=5$, the difference between the lower and upper bounds is smaller than $10^{-4}$.
