Table of Contents
Fetching ...

UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct Counting

Otmar Ertl

TL;DR

UltraLogLog (ULL) generalizes HyperLogLog, EHLL, and PCSA into a single compact data structure with 8-bit registers that preserves idempotent, commutative inserts and mergeability. The authors derive analytic expressions for Fisher information and Shannon entropy to guide parameter choice, achieving a MVP about $4.63$ (28% improvement over HLL) for a practical base $B=2$ configuration with $8$-bit registers and 2 extra bits. They present a fast GRA-based estimator (FGRA) and a high-precision maximum-likelihood estimator (MLE) that approach the Cramér–Rao bound, plus small- and large-count corrections and a martingale estimator for non-distributed settings, with FGRA delivering about 94.6% efficiency in their setup. Comprehensive experiments validate the theoretical MVP gains, demonstrate improved compressibility, and show competitive speed, with an open-source Java implementation in Hash4j enabling straightforward deployment and migration from HLL. Overall, ULL offers significant space efficiency while maintaining practical accuracy and compatibility, suggesting it could become a new standard for approximate distinct counting.

Abstract

Since its invention HyperLogLog has become the standard algorithm for approximate distinct counting. Due to its space efficiency and suitability for distributed systems, it is widely used and also implemented in numerous databases. This work presents UltraLogLog, which shares the same practical properties as HyperLogLog. It is commutative, idempotent, mergeable, and has a fast guaranteed constant-time insert operation. At the same time, it requires 28% less space to encode the same amount of distinct count information, which can be extracted using the maximum likelihood method. Alternatively, a simpler and faster estimator is proposed, which still achieves a space reduction of 24%, but at an estimation speed comparable to that of HyperLogLog. In a non-distributed setting where martingale estimation can be used, UltraLogLog is able to reduce space by 17%. Moreover, its smaller entropy and its 8-bit registers lead to better compaction when using standard compression algorithms. All this is verified by experimental results that are in perfect agreement with the theoretical analysis which also outlines potential for even more space-efficient data structures. A production-ready Java implementation of UltraLogLog has been released as part of the open-source Hash4j library.

UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct Counting

TL;DR

UltraLogLog (ULL) generalizes HyperLogLog, EHLL, and PCSA into a single compact data structure with 8-bit registers that preserves idempotent, commutative inserts and mergeability. The authors derive analytic expressions for Fisher information and Shannon entropy to guide parameter choice, achieving a MVP about (28% improvement over HLL) for a practical base configuration with -bit registers and 2 extra bits. They present a fast GRA-based estimator (FGRA) and a high-precision maximum-likelihood estimator (MLE) that approach the Cramér–Rao bound, plus small- and large-count corrections and a martingale estimator for non-distributed settings, with FGRA delivering about 94.6% efficiency in their setup. Comprehensive experiments validate the theoretical MVP gains, demonstrate improved compressibility, and show competitive speed, with an open-source Java implementation in Hash4j enabling straightforward deployment and migration from HLL. Overall, ULL offers significant space efficiency while maintaining practical accuracy and compatibility, suggesting it could become a new standard for approximate distinct counting.

Abstract

Since its invention HyperLogLog has become the standard algorithm for approximate distinct counting. Due to its space efficiency and suitability for distributed systems, it is widely used and also implemented in numerous databases. This work presents UltraLogLog, which shares the same practical properties as HyperLogLog. It is commutative, idempotent, mergeable, and has a fast guaranteed constant-time insert operation. At the same time, it requires 28% less space to encode the same amount of distinct count information, which can be extracted using the maximum likelihood method. Alternatively, a simpler and faster estimator is proposed, which still achieves a space reduction of 24%, but at an estimation speed comparable to that of HyperLogLog. In a non-distributed setting where martingale estimation can be used, UltraLogLog is able to reduce space by 17%. Moreover, its smaller entropy and its 8-bit registers lead to better compaction when using standard compression algorithms. All this is verified by experimental results that are in perfect agreement with the theoretical analysis which also outlines potential for even more space-efficient data structures. A production-ready Java implementation of UltraLogLog has been released as part of the open-source Hash4j library.
Paper Structure (26 sections, 18 theorems, 129 equations, 18 figures, 1 table, 7 algorithms)

This paper contains 26 sections, 18 theorems, 129 equations, 18 figures, 1 table, 7 algorithms.

Key Result

Lemma 1

If $\symZ_\symMaxUpdateVal := e^{-\frac{\symCardinality(\symBase-1)}{\symNumReg\symBase^{\symMaxUpdateVal}}}$ with $\symCardinality, \symNumReg > 0$ and $\symBase > 1$ the following identities hold:

Figures (18)

  • Figure 1: The theoretical asymptotic *MVP \ref{['equ:mvp_uncompressed']} over the base $\symBase$ for $\symBitsForMax=6$ and $\symBitsForMax=7$ and various values of $\symNumExtraBits$ when assuming a memory footprint of $\symNumReg(\symBitsForMax+\symNumExtraBits)$ bits. The top chart shows the 28% improvement of *ULL over *HLL.
  • Figure 2: The theoretical asymptotic *MVP \ref{['equ:mvp_compressed']} over the base $\symBase$ for various values of $\symNumExtraBits$ under the assumption of optimal lossless compression. The *MVP of *ULL is 24% smaller than that of *HLL.
  • Figure 3: a) The asymptotic *GRA estimator efficiency over the base $\symBase$ for $\symBitsForMax=6$ and various values of $\symNumExtraBits$. b) The efficiencies of the *GRA estimator and our proposed *FGRA estimator as a function of $\symGRA$ for $\symBase=2$ and $\symNumExtraBits=2$. Crosses indicate optimal choices of $\symGRA$.
  • Figure 4: The *MVP \ref{['equ:mvp_martingale']} as a function of the base $\symBase$ for $\symBitsForMax=6$ and various values of $\symNumExtraBits$ when using martingale estimation. The *MVP of *ULL is 17% smaller compared to *HLL.
  • Figure 5: The relative bias and the *RMSE for the *FGRA, *ML, and the martingale estimator for precisions $\symPrecision\in\lbrace 8, 12, 16\rbrace$ obtained from 100 000 simulation runs. The theoretically predicted errors perfectly match the experimental results. Individual insertions were simulated up to a distinct count of 10$^{6}$ before switching to the fast simulation strategy.
  • ...and 13 more figures

Theorems & Definitions (36)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • ...and 26 more