Huffman-Bucket Sketch: A Simple $O(m)$ Algorithm for Cardinality Estimation

Matti Karppa

Huffman-Bucket Sketch: A Simple $O(m)$ Algorithm for Cardinality Estimation

Matti Karppa

TL;DR

The Huffman-Bucket Sketch is introduced, a simple, mergeable data structure that losslessly compresses a HyperLogLog sketch with registers to optimal space and it is proved that the Huffman tree needs rebuilding only $O(\log n)$ times over a stream, roughly when cardinality doubles.

Abstract

We introduce the Huffman-Bucket Sketch (HBS), a simple, mergeable data structure that losslessly compresses a HyperLogLog (HLL) sketch with $m$ registers to optimal space $O(m+\log n)$ bits, with amortized constant-time updates, acting as a drop-in replacement for HLL that retains mergeability and substantially reduces memory requirements. We partition registers into small buckets and encode their values with a global Huffman codebook derived from the strongly concentrated HLL rank distribution, using the current cardinality estimate for determining the mode of the distribution. We prove that the Huffman tree needs rebuilding only $O(\log n)$ times over a stream, roughly when cardinality doubles. The framework can be extended to other sketches with similar strongly concentrated distributions. We provide preliminary numerical evidence that suggests that HBS is practical and can potentially be competitive with state-of-the-art in practice.

Huffman-Bucket Sketch: A Simple $O(m)$ Algorithm for Cardinality Estimation

TL;DR

times over a stream, roughly when cardinality doubles.

Abstract

We introduce the Huffman-Bucket Sketch (HBS), a simple, mergeable data structure that losslessly compresses a HyperLogLog (HLL) sketch with

registers to optimal space

bits, with amortized constant-time updates, acting as a drop-in replacement for HLL that retains mergeability and substantially reduces memory requirements. We partition registers into small buckets and encode their values with a global Huffman codebook derived from the strongly concentrated HLL rank distribution, using the current cardinality estimate for determining the mode of the distribution. We prove that the Huffman tree needs rebuilding only

times over a stream, roughly when cardinality doubles. The framework can be extended to other sketches with similar strongly concentrated distributions. We provide preliminary numerical evidence that suggests that HBS is practical and can potentially be competitive with state-of-the-art in practice.

Paper Structure (39 sections, 46 theorems, 34 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 39 sections, 46 theorems, 34 equations, 2 figures, 1 table, 1 algorithm.

Introduction
Related work
Preliminaries
Mathematical notation and preliminaries
Poissonized balls-and-bins model
HyperLogLog (HLL)
Huffman coding
Algorithm and data structure
Huffman-Bucket Sketch data structure
Operations on the Huffman-Bucket Sketch
Peek (lookup register value)
Poke (set register value)
Insert (update sketch with a new element)
Merge (combine two sketches into one sketch)
Analysis in the poissonized balls-and-bins model
...and 24 more sections

Key Result

Theorem 1

The total size of the Huffman-Bucket Sketch data structure is $O(m+\log n)$ bits, which is optimal AlonMS:1999IndykW:2003KaneNW:2010Woodruff:2004.

Figures (2)

Figure 1: Register value distribution for various $\lambda$, with the marker representing $r^*=\lceil\log_2\lambda\rceil$.
Figure 2: (a) The bit size of the bucket array or total codeword size $L$ as function of the number of registers in the bucket $B$. The solid line represents the mean of one million repetitions, and the dashed lines the minimum and maximum sizes. (b) The size of a bucket with a fixed number of registers $B$ as a function of the load factor $\lambda$ over one million random sketches. The marker represents the mean, the error bars represent the minimum and maximum sizes.

Theorems & Definitions (92)

Theorem 1
proof
Theorem 2
proof
Proposition 3
proof
Proposition 3
proof
Proposition 3
proof
...and 82 more

Huffman-Bucket Sketch: A Simple $O(m)$ Algorithm for Cardinality Estimation

TL;DR

Abstract

Huffman-Bucket Sketch: A Simple $O(m)$ Algorithm for Cardinality Estimation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (92)