Table of Contents
Fetching ...

ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale

Otmar Ertl

TL;DR

ExaLogLog introduces ELL, a generalized, space-efficient sketch for approximate distinct counting that preserves mergeability, idempotence, reproducibility, and reducibility while enabling constant-time inserts. By adopting a novel update-value distribution and ML estimation via Newton’s method (with a martingale option), it achieves a practical MVP as low as $3.67$ and a reported 43% space reduction over HyperLogLog for the same accuracy, extending scalability to exa-scale counts. The work also provides a sparse-mode representation using hash tokens that allows deferred allocation, along with a reference Java implementation and extensive experimental validation. Overall, ELL offers a versatile, scalable approach for distributed data stores and analytics that balances theoretical guarantees with real-world performance.

Abstract

This work introduces ExaLogLog, a new data structure for approximate distinct counting, which has the same practical properties as the popular HyperLogLog algorithm. It is commutative, idempotent, mergeable, reducible, has a constant-time insert operation, and supports distinct counts up to the exa-scale. At the same time, as theoretically derived and experimentally verified, it requires 43% less space to achieve the same estimation error.

ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale

TL;DR

ExaLogLog introduces ELL, a generalized, space-efficient sketch for approximate distinct counting that preserves mergeability, idempotence, reproducibility, and reducibility while enabling constant-time inserts. By adopting a novel update-value distribution and ML estimation via Newton’s method (with a martingale option), it achieves a practical MVP as low as and a reported 43% space reduction over HyperLogLog for the same accuracy, extending scalability to exa-scale counts. The work also provides a sparse-mode representation using hash tokens that allows deferred allocation, along with a reference Java implementation and extensive experimental validation. Overall, ELL offers a versatile, scalable approach for distributed data stores and analytics that balances theoretical guarantees with real-world performance.

Abstract

This work introduces ExaLogLog, a new data structure for approximate distinct counting, which has the same practical properties as the popular HyperLogLog algorithm. It is commutative, idempotent, mergeable, reducible, has a constant-time insert operation, and supports distinct counts up to the exa-scale. At the same time, as theoretically derived and experimentally verified, it requires 43% less space to achieve the same estimation error.
Paper Structure (25 sections, 3 theorems, 44 equations, 19 figures, 2 tables, 8 algorithms)

This paper contains 25 sections, 3 theorems, 44 equations, 19 figures, 2 tables, 8 algorithms.

Key Result

lemma 1

For $\rho_\text{\normalfont update}$ and $\phi$ as defined in (equ:update_density) and (equ:exponent_func)$\sum_{k = u+1}^{(65 - p - t)2^t}\rho_\text{\normalfont update}(k) = \frac{2^{t}(1-t +\phi(u)) - u }{2^{\phi(u)}}$ holds.

Figures (19)

  • Figure 1: The memory over the relative standard error for different *MVP following (\ref{['equ:mvp_def']}).
  • Figure 2: Comparing the *PMF (\ref{['equ:geometric']}) and (\ref{['equ:step_dist']}) for $b = 2^{2^{-t}}$.
  • Figure 3: Two element insertions into an *ELL sketch with parameters $p=2$, $t=2$, $d=6$ which has $2^p=4$ registers with a size of $6+t+d=14$ bits.
  • Figure 4: The *MVP according to (\ref{['equ:mvp_uncompressed']}) with $b = 2^{2^{-t}}$ and $q= 6 + t$ when storing the registers in a bit array and using an efficient unbiased estimator. Arrows indicate minima.
  • Figure 5: The *MVP according to (\ref{['equ:mvp_martingale']}) with $b = 2^{2^{-t}}$ and $q= 6 + t$ when storing the registers in a bit array and using the martingale estimator. Arrows indicate minima.
  • ...and 14 more figures

Theorems & Definitions (6)

  • lemma 1
  • proof
  • lemma 2
  • proof
  • lemma 3
  • proof