Table of Contents
Fetching ...

ShockHash: Near Optimal-Space Minimal Perfect Hashing Beyond Brute-Force

Hans-Peter Lehmann, Peter Sanders, Stefan Walzer

TL;DR

Small, heavily overloaded cuckoo hash tables for minimal perfect hashing using two hash functions ShockHash, which uses a 1-bit retrieval data structure to store f using n+o(n) and setlength ofoddsidemargin}{-69pt.

Abstract

A minimal perfect hash function (MPHF) maps a set S of n keys to the first n integers without collisions. There is a lower bound of n*log(e)=1.44n bits needed to represent an MPHF. This can be reached by a brute-force algorithm that tries e^n hash function seeds in expectation and stores the first seed leading to an MPHF. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block. In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash tables for minimal perfect hashing. ShockHash uses two hash functions h_0 and h_1, hoping for the existence of a function f : S->{0, 1} such that x -> h_{f(x)}(x) is an MPHF on S. It then uses a 1-bit retrieval data structure to store f using n + o(n) bits. In graph terminology, ShockHash generates n-edge random graphs until stumbling on a pseudoforest - where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. We show that ShockHash needs to try only about (e/2)^n=1.359^n seeds in expectation. This reduces the space for storing the seed by roughly n bits (maintaining the asymptotically optimal space consumption) and speeds up construction by almost a factor of 2^n compared to brute-force. Bipartite ShockHash reduces the expected construction time again to 1.166^n by maintaining a pool of candidate hash functions and checking all possible pairs. ShockHash as a building block within the RecSplit framework can be constructed up to 3 orders of magnitude faster than competing approaches. It can build an MPHF for 10 million keys with 1.489 bits per key in about half an hour. When instead using ShockHash after an efficient k-perfect hash function, it achieves space usage similar to the best competitors, while being significantly faster to construct and query.

ShockHash: Near Optimal-Space Minimal Perfect Hashing Beyond Brute-Force

TL;DR

Small, heavily overloaded cuckoo hash tables for minimal perfect hashing using two hash functions ShockHash, which uses a 1-bit retrieval data structure to store f using n+o(n) and setlength ofoddsidemargin}{-69pt.

Abstract

A minimal perfect hash function (MPHF) maps a set S of n keys to the first n integers without collisions. There is a lower bound of n*log(e)=1.44n bits needed to represent an MPHF. This can be reached by a brute-force algorithm that tries e^n hash function seeds in expectation and stores the first seed leading to an MPHF. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block. In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash tables for minimal perfect hashing. ShockHash uses two hash functions h_0 and h_1, hoping for the existence of a function f : S->{0, 1} such that x -> h_{f(x)}(x) is an MPHF on S. It then uses a 1-bit retrieval data structure to store f using n + o(n) bits. In graph terminology, ShockHash generates n-edge random graphs until stumbling on a pseudoforest - where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. We show that ShockHash needs to try only about (e/2)^n=1.359^n seeds in expectation. This reduces the space for storing the seed by roughly n bits (maintaining the asymptotically optimal space consumption) and speeds up construction by almost a factor of 2^n compared to brute-force. Bipartite ShockHash reduces the expected construction time again to 1.166^n by maintaining a pool of candidate hash functions and checking all possible pairs. ShockHash as a building block within the RecSplit framework can be constructed up to 3 orders of magnitude faster than competing approaches. It can build an MPHF for 10 million keys with 1.489 bits per key in about half an hour. When instead using ShockHash after an efficient k-perfect hash function, it achieves space usage similar to the best competitors, while being significantly faster to construct and query.
Paper Structure (33 sections, 26 theorems, 11 equations, 15 figures, 1 table)

This paper contains 33 sections, 26 theorems, 11 equations, 15 figures, 1 table.

Key Result

Lemma 1

The probability for a seed to pass the filter, i.e. for every table cell to be hit by at least one key, is at most $(1-e^{-2}+o(1))^n ≈ 0.864^n$.

Figures (15)

  • Figure 1: Illustrations of different pairing functions.
  • Figure 2: Illustration of the ShockHash construction. Functions $h_0$ and $h_1$ are randomly sampled hash functions using a seed $s$. Here, $s$ is a seed value where the resulting graph is a pseudotree. During construction, many seeds need to be tried.
  • Figure 3: ShockHash and bipartite ShockHash. The pseudocode illustrates the overall idea but does not lead to any performance improvements yet.
  • Figure 4: Pseudocode of bipartite ShockHash.
  • Figure 5: Illustration of the filtering involved in ShockHash and bipartite ShockHash. The construction is complete if we find one final seed. ShockHash determines both hash functions from the same seed. Bipartite ShockHash uses independent seeds for the two hash functions and filters the seeds before combining them.
  • ...and 10 more figures

Theorems & Definitions (26)

  • Lemma 1
  • Theorem 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7: see esposito2020recsplit
  • Theorem 8
  • Lemma 9
  • Theorem 10
  • ...and 16 more