Table of Contents
Fetching ...

Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon

Seth Pettie, Dingyu Wang

TL;DR

This work introduces the Fisher-Shannon FiSh-number as a rigorous information-theoretic lens for the space–error tradeoff in cardinality estimation sketches. It shows that base-$q$ PCSA attains FiSh = $H_0/I_0\approx1.98016$, while base-$q$ LL is strictly less efficient for finite $q$ but approaches the same limit as $q$ grows. The authors propose Fishmonger, a compressed, smoothed PCSA-based sketch achieving near-optimal space with $1/\sqrt{m}$ standard error, and prove a sharp lower bound for linearizable mergeable sketches that aligns with the PCSA benchmark. Together with a structural analysis of sketch classes and the notion of random-offset smoothing, these results provide precise constants for the best possible mergeable designs under the random oracle model and offer practical guidance for memory-efficient cardinality estimation. The work thus bridges Fisher information and Shannon entropy to yield tight, non-asymptotic insights into the optimality of classic and new sketches, with Fishmonger offering a concrete path toward implementable, near-optimal systems.

Abstract

Estimating the cardinality (number of distinct elements) of a large multiset is a classic problem in streaming and sketching. In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error. We define a new measure of efficiency for data sketches called the Fisher-Shannon (FiSh) number $\mathcal{H}/\mathcal{I}$. It captures the tension between the limiting Shannon entropy ($\mathcal{H}$) of the sketch and its normalized Fisher information ($\mathcal{I}$) that characterizes the variance of a statistically efficient, asymptotically unbiased estimator. Our aim in introducing the FiSh-number is to build the mathematical machinery necessary to argue for precise optimality, rather than asymptotic optimality, up to large constant factors. Our results are as follows. [1] We prove that all base-$q$ variants of Flajolet and Martin's PCSA sketch have FiSh-number $H_0/I_0 \approx 1.98016$ and that every base-$q$ variant of HyperLogLog has FiSh-number worse than $H_0/I_0$, but that they tend to $H_0/I_0$ in the limit as $q\rightarrow \infty$. Here $H_0,I_0$ are precisely defined constants. [2] We describe a sketch called Fishmonger that is based on a smoothed, entropy-compressed variant of PCSA with a different estimator function. Fishmonger processes a multiset of $[U]$ such that at all times, w.h.p., its space is $(1+o(1))(H_0/I_0)m \approx 1.98m$ bits and its standard error is $1/\sqrt{m}$. For example, to achieve a 1% standard error, one needs a little more than 19,800 bits, or $\approx 2.42$ kilobytes. [3] Finally, we give circumstantial evidence that $H_0/I_0$ is the optimum FiSh-number of mergeable sketches for Cardinality Estimation. We define a natural subset of mergeable sketches called linearizable sketches and prove that no member of this class can beat $H_0/I_0$. The popular mergeable sketches are, in fact, also linearizable.

Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon

TL;DR

This work introduces the Fisher-Shannon FiSh-number as a rigorous information-theoretic lens for the space–error tradeoff in cardinality estimation sketches. It shows that base- PCSA attains FiSh = , while base- LL is strictly less efficient for finite but approaches the same limit as grows. The authors propose Fishmonger, a compressed, smoothed PCSA-based sketch achieving near-optimal space with standard error, and prove a sharp lower bound for linearizable mergeable sketches that aligns with the PCSA benchmark. Together with a structural analysis of sketch classes and the notion of random-offset smoothing, these results provide precise constants for the best possible mergeable designs under the random oracle model and offer practical guidance for memory-efficient cardinality estimation. The work thus bridges Fisher information and Shannon entropy to yield tight, non-asymptotic insights into the optimality of classic and new sketches, with Fishmonger offering a concrete path toward implementable, near-optimal systems.

Abstract

Estimating the cardinality (number of distinct elements) of a large multiset is a classic problem in streaming and sketching. In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error. We define a new measure of efficiency for data sketches called the Fisher-Shannon (FiSh) number . It captures the tension between the limiting Shannon entropy () of the sketch and its normalized Fisher information () that characterizes the variance of a statistically efficient, asymptotically unbiased estimator. Our aim in introducing the FiSh-number is to build the mathematical machinery necessary to argue for precise optimality, rather than asymptotic optimality, up to large constant factors. Our results are as follows. [1] We prove that all base- variants of Flajolet and Martin's PCSA sketch have FiSh-number and that every base- variant of HyperLogLog has FiSh-number worse than , but that they tend to in the limit as . Here are precisely defined constants. [2] We describe a sketch called Fishmonger that is based on a smoothed, entropy-compressed variant of PCSA with a different estimator function. Fishmonger processes a multiset of such that at all times, w.h.p., its space is bits and its standard error is . For example, to achieve a 1% standard error, one needs a little more than 19,800 bits, or kilobytes. [3] Finally, we give circumstantial evidence that is the optimum FiSh-number of mergeable sketches for Cardinality Estimation. We define a natural subset of mergeable sketches called linearizable sketches and prove that no member of this class can beat . The popular mergeable sketches are, in fact, also linearizable.

Paper Structure

This paper contains 43 sections, 32 theorems, 152 equations, 7 figures, 2 tables.

Key Result

Theorem 1

Let $(X_0,X_1,\ldots,X_{m-1})$ be a tuple of random variables. Then $H(X_0,X_1,\ldots,X_{m-1})=\sum_{i=0}^{m-1} H(X_i\mid X_{0},\ldots,X_{i-1})$.

Figures (7)

  • Figure 1: (a) The PCSA-partition of the Dartboard into $m=16$ columns, with 32 darts. (b) The state of the PCSA sketch; occupied cells are blue. (c) The state of the (Hyper)LogLog sketch, which uses the same partition. Every cell hit by a dart or below one hit by a dart is occupied.
  • Figure 2: (a) The Dartboard is partitioned into individual hash values. (b) A Bottom-4 sketch keeps the smallest 4 hash values, hence all occupied (blue) cells have no effect on the sketch.
  • Figure 3: (a) The S-BitmapDartboard partition, with $\lambda=17$ darts. (b) The state of S-Bitmap, if the darts were processed in left-to-right order. (c) The state of the S-Bitmap, if the darts were processed in top-to-bottom order. The S-Bitmap's transition function is not commutative and idempotent and is therefore not mergeable.
  • Figure 4: A classification of sketching algorithms for cardinality estimation.
  • Figure 5: Entropy and normalized Fisher information number for $q$-LogLog skecthes for $\lambda\in[2^{16},2^{24}]$. See Section \ref{['sect:qll_def']} for the precise definitions. Left: At a sufficiently small scale, the oscillations in entropy (top) and normalized information (bottom) of 2-$\textsf{LL}$ become visible. Right: At higher values of $q \in \{2,4,16\}$, the oscillations in entropy (top) and normalized information (bottom) of $q\text{-}\textsf{LL}$ are clearly visible.
  • ...and 2 more figures

Theorems & Definitions (72)

  • Remark 1
  • Theorem 1: chain rule for entropy CoverT06
  • Theorem 2: chain rule for Fisher information zegers2015fisher
  • Definition 1: Induced Distribution Family
  • Definition 2: Weak Scale-Invariance
  • Remark 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 62 more