ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale

Otmar Ertl

ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale

Otmar Ertl

TL;DR

ExaLogLog introduces ELL, a generalized, space-efficient sketch for approximate distinct counting that preserves mergeability, idempotence, reproducibility, and reducibility while enabling constant-time inserts. By adopting a novel update-value distribution and ML estimation via Newton’s method (with a martingale option), it achieves a practical MVP as low as $3.67$ and a reported 43% space reduction over HyperLogLog for the same accuracy, extending scalability to exa-scale counts. The work also provides a sparse-mode representation using hash tokens that allows deferred allocation, along with a reference Java implementation and extensive experimental validation. Overall, ELL offers a versatile, scalable approach for distributed data stores and analytics that balances theoretical guarantees with real-world performance.

Abstract

This work introduces ExaLogLog, a new data structure for approximate distinct counting, which has the same practical properties as the popular HyperLogLog algorithm. It is commutative, idempotent, mergeable, reducible, has a constant-time insert operation, and supports distinct counts up to the exa-scale. At the same time, as theoretically derived and experimentally verified, it requires 43% less space to achieve the same estimation error.

ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale

TL;DR

and a reported 43% space reduction over HyperLogLog for the same accuracy, extending scalability to exa-scale counts. The work also provides a sparse-mode representation using hash tokens that allows deferred allocation, along with a reference Java implementation and extensive experimental validation. Overall, ELL offers a versatile, scalable approach for distributed data stores and analytics that balances theoretical guarantees with real-world performance.

Abstract

Paper Structure (25 sections, 3 theorems, 44 equations, 19 figures, 2 tables, 8 algorithms)

This paper contains 25 sections, 3 theorems, 44 equations, 19 figures, 2 tables, 8 algorithms.

Introduction
Related Work
Summary of Contributions
Data Structure
Previous Theoretical Results
Approximated Update Value Distribution
*ELL
Choice of Parameters
Relationship to Other Data Structures
Statistical Inference
Probability Mass Function for Registers
Maximum-Likelihood Estimation
Martingale Estimation
Practical Implementation
Mergeability
...and 10 more sections

Key Result

lemma 1

For $\rho_\text{\normalfont update}$ and $\phi$ as defined in (equ:update_density) and (equ:exponent_func)$\sum_{k = u+1}^{(65 - p - t)2^t}\rho_\text{\normalfont update}(k) = \frac{2^{t}(1-t +\phi(u)) - u }{2^{\phi(u)}}$ holds.

Figures (19)

Figure 1: The memory over the relative standard error for different *MVP following (\ref{['equ:mvp_def']}).
Figure 2: Comparing the *PMF (\ref{['equ:geometric']}) and (\ref{['equ:step_dist']}) for $b = 2^{2^{-t}}$.
Figure 3: Two element insertions into an *ELL sketch with parameters $p=2$, $t=2$, $d=6$ which has $2^p=4$ registers with a size of $6+t+d=14$ bits.
Figure 4: The *MVP according to (\ref{['equ:mvp_uncompressed']}) with $b = 2^{2^{-t}}$ and $q= 6 + t$ when storing the registers in a bit array and using an efficient unbiased estimator. Arrows indicate minima.
Figure 5: The *MVP according to (\ref{['equ:mvp_martingale']}) with $b = 2^{2^{-t}}$ and $q= 6 + t$ when storing the registers in a bit array and using the martingale estimator. Arrows indicate minima.
...and 14 more figures

Theorems & Definitions (6)

lemma 1
proof
lemma 2
proof
lemma 3
proof

ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale

TL;DR

Abstract

ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (6)