Table of Contents
Fetching ...

Hashing for Sampling-Based Estimation

Anders Aamand, Ioana O. Bercea, Jakob Bæk Tejs Houen, Jonas Klausen, Mikkel Thorup

TL;DR

This paper delivers strong, explicit concentration bounds for Tornado Tabulation hashing within the local uniformity framework, enabling two-sided concentration results for sampling-based estimation with realistic, constant-time hashing. By decomposing the analysis into local uniformity, layer-based concentration, and obstruction-based probability control, the authors achieve a practical, provable two-sided Chernoff-type bound with small additive error terms. The contributions translate into robust hash-based sampling tools—threshold, bottom-k, vector-k, and k-partition-min sketches—with provable behavior close to fully random hashing, across many sets and large-scale sketches. The results have direct implications for sampling accuracy, distinct-element counting, and similarity estimation in large data streams and databases, enabling reliable union-bound analyses and high-probability guarantees in practice.

Abstract

Hash-based sampling and estimation are common themes in computing. Using hashing for sampling gives us the coordination needed to compare samples from different sets. Hashing is also used when we want to count distinct elements. The quality of the estimator for, say, the Jaccard similarity between two sets, depends on the concentration of the number of sampled elements from their intersection. Often we want to compare one query set against many stored sets to find one of the most similar sets, so we need strong concentration and low error-probability. In this paper, we provide strong explicit concentration bounds for Tornado Tabulation hashing [Bercea, Beretta, Klausen, Houen, and Thorup, FOCS'23] which is a realistic constant time hashing scheme. Previous concentration bounds for fast hashing were off by orders of magnitude, in the sample size needed to guarantee the same concentration. The true power of our result appears when applied in the local uniformity framework by [Dahlgaard, Knudsen, Rotenberg, and Thorup, STOC'15].

Hashing for Sampling-Based Estimation

TL;DR

This paper delivers strong, explicit concentration bounds for Tornado Tabulation hashing within the local uniformity framework, enabling two-sided concentration results for sampling-based estimation with realistic, constant-time hashing. By decomposing the analysis into local uniformity, layer-based concentration, and obstruction-based probability control, the authors achieve a practical, provable two-sided Chernoff-type bound with small additive error terms. The contributions translate into robust hash-based sampling tools—threshold, bottom-k, vector-k, and k-partition-min sketches—with provable behavior close to fully random hashing, across many sets and large-scale sketches. The results have direct implications for sampling accuracy, distinct-element counting, and similarity estimation in large data streams and databases, enabling reliable union-bound analyses and high-probability guarantees in practice.

Abstract

Hash-based sampling and estimation are common themes in computing. Using hashing for sampling gives us the coordination needed to compare samples from different sets. Hashing is also used when we want to count distinct elements. The quality of the estimator for, say, the Jaccard similarity between two sets, depends on the concentration of the number of sampled elements from their intersection. Often we want to compare one query set against many stored sets to find one of the most similar sets, so we need strong concentration and low error-probability. In this paper, we provide strong explicit concentration bounds for Tornado Tabulation hashing [Bercea, Beretta, Klausen, Houen, and Thorup, FOCS'23] which is a realistic constant time hashing scheme. Previous concentration bounds for fast hashing were off by orders of magnitude, in the sample size needed to guarantee the same concentration. The true power of our result appears when applied in the local uniformity framework by [Dahlgaard, Knudsen, Rotenberg, and Thorup, STOC'15].

Paper Structure

This paper contains 64 sections, 45 theorems, 188 equations, 1 table.

Key Result

Theorem 1

For any $b \geq 1$ and $c \leq \ln s$, if $s \geq 2^{16} \cdot b^2$, and $\mu \in [s/4, s/2]$. For any $\delta > 0$,

Theorems & Definitions (76)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 4
  • Lemma 5
  • proof
  • Claim 6
  • proof
  • Theorem 8
  • ...and 66 more