Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon
Seth Pettie, Dingyu Wang
TL;DR
This work introduces the Fisher-Shannon FiSh-number as a rigorous information-theoretic lens for the space–error tradeoff in cardinality estimation sketches. It shows that base-$q$ PCSA attains FiSh = $H_0/I_0\approx1.98016$, while base-$q$ LL is strictly less efficient for finite $q$ but approaches the same limit as $q$ grows. The authors propose Fishmonger, a compressed, smoothed PCSA-based sketch achieving near-optimal space with $1/\sqrt{m}$ standard error, and prove a sharp lower bound for linearizable mergeable sketches that aligns with the PCSA benchmark. Together with a structural analysis of sketch classes and the notion of random-offset smoothing, these results provide precise constants for the best possible mergeable designs under the random oracle model and offer practical guidance for memory-efficient cardinality estimation. The work thus bridges Fisher information and Shannon entropy to yield tight, non-asymptotic insights into the optimality of classic and new sketches, with Fishmonger offering a concrete path toward implementable, near-optimal systems.
Abstract
Estimating the cardinality (number of distinct elements) of a large multiset is a classic problem in streaming and sketching. In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error. We define a new measure of efficiency for data sketches called the Fisher-Shannon (FiSh) number $\mathcal{H}/\mathcal{I}$. It captures the tension between the limiting Shannon entropy ($\mathcal{H}$) of the sketch and its normalized Fisher information ($\mathcal{I}$) that characterizes the variance of a statistically efficient, asymptotically unbiased estimator. Our aim in introducing the FiSh-number is to build the mathematical machinery necessary to argue for precise optimality, rather than asymptotic optimality, up to large constant factors. Our results are as follows. [1] We prove that all base-$q$ variants of Flajolet and Martin's PCSA sketch have FiSh-number $H_0/I_0 \approx 1.98016$ and that every base-$q$ variant of HyperLogLog has FiSh-number worse than $H_0/I_0$, but that they tend to $H_0/I_0$ in the limit as $q\rightarrow \infty$. Here $H_0,I_0$ are precisely defined constants. [2] We describe a sketch called Fishmonger that is based on a smoothed, entropy-compressed variant of PCSA with a different estimator function. Fishmonger processes a multiset of $[U]$ such that at all times, w.h.p., its space is $(1+o(1))(H_0/I_0)m \approx 1.98m$ bits and its standard error is $1/\sqrt{m}$. For example, to achieve a 1% standard error, one needs a little more than 19,800 bits, or $\approx 2.42$ kilobytes. [3] Finally, we give circumstantial evidence that $H_0/I_0$ is the optimum FiSh-number of mergeable sketches for Cardinality Estimation. We define a natural subset of mergeable sketches called linearizable sketches and prove that no member of this class can beat $H_0/I_0$. The popular mergeable sketches are, in fact, also linearizable.
