Table of Contents
Fetching ...

Huffman-Bucket Sketch: A Simple $O(m)$ Algorithm for Cardinality Estimation

Matti Karppa

TL;DR

The Huffman-Bucket Sketch is introduced, a simple, mergeable data structure that losslessly compresses a HyperLogLog sketch with registers to optimal space and it is proved that the Huffman tree needs rebuilding only $O(\log n)$ times over a stream, roughly when cardinality doubles.

Abstract

We introduce the Huffman-Bucket Sketch (HBS), a simple, mergeable data structure that losslessly compresses a HyperLogLog (HLL) sketch with $m$ registers to optimal space $O(m+\log n)$ bits, with amortized constant-time updates, acting as a drop-in replacement for HLL that retains mergeability and substantially reduces memory requirements. We partition registers into small buckets and encode their values with a global Huffman codebook derived from the strongly concentrated HLL rank distribution, using the current cardinality estimate for determining the mode of the distribution. We prove that the Huffman tree needs rebuilding only $O(\log n)$ times over a stream, roughly when cardinality doubles. The framework can be extended to other sketches with similar strongly concentrated distributions. We provide preliminary numerical evidence that suggests that HBS is practical and can potentially be competitive with state-of-the-art in practice.

Huffman-Bucket Sketch: A Simple $O(m)$ Algorithm for Cardinality Estimation

TL;DR

The Huffman-Bucket Sketch is introduced, a simple, mergeable data structure that losslessly compresses a HyperLogLog sketch with registers to optimal space and it is proved that the Huffman tree needs rebuilding only times over a stream, roughly when cardinality doubles.

Abstract

We introduce the Huffman-Bucket Sketch (HBS), a simple, mergeable data structure that losslessly compresses a HyperLogLog (HLL) sketch with registers to optimal space bits, with amortized constant-time updates, acting as a drop-in replacement for HLL that retains mergeability and substantially reduces memory requirements. We partition registers into small buckets and encode their values with a global Huffman codebook derived from the strongly concentrated HLL rank distribution, using the current cardinality estimate for determining the mode of the distribution. We prove that the Huffman tree needs rebuilding only times over a stream, roughly when cardinality doubles. The framework can be extended to other sketches with similar strongly concentrated distributions. We provide preliminary numerical evidence that suggests that HBS is practical and can potentially be competitive with state-of-the-art in practice.
Paper Structure (39 sections, 46 theorems, 34 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 39 sections, 46 theorems, 34 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

The total size of the Huffman-Bucket Sketch data structure is $O(m+\log n)$ bits, which is optimal AlonMS:1999IndykW:2003KaneNW:2010Woodruff:2004.

Figures (2)

  • Figure 1: Register value distribution for various $\lambda$, with the marker representing $r^*=\lceil\log_2\lambda\rceil$.
  • Figure 2: (a) The bit size of the bucket array or total codeword size $L$ as function of the number of registers in the bucket $B$. The solid line represents the mean of one million repetitions, and the dashed lines the minimum and maximum sizes. (b) The size of a bucket with a fixed number of registers $B$ as a function of the load factor $\lambda$ over one million random sketches. The marker represents the mean, the error bars represent the minimum and maximum sizes.

Theorems & Definitions (92)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Proposition 3
  • proof
  • Proposition 3
  • proof
  • Proposition 3
  • proof
  • ...and 82 more