Table of Contents
Fetching ...

Modern Minimal Perfect Hashing: A Survey

Hans-Peter Lehmann, Thomas Mueller, Rasmus Pagh, Giulio Ermanno Pibiri, Peter Sanders, Sebastiano Vigna, Stefan Walzer

TL;DR

This survey addresses the problem of constructing minimal perfect hash functions (MPHFs) that map a known set of n keys to exactly [n] with no collisions, while using space near the information-theoretic lower bound $n\log_2 e$. It systematically categorizes modern MPHFs into retrieval-based, brute-force, and fingerprinting approaches, analyzes their space, construction time, and query performance, and provides a comprehensive experimental comparison across billions of keys. Key contributions include a detailed taxonomy, clarified relationship to graph-theoretic notions like peelability and orientability, and extensive benchmarks that guide practitioners in selecting the right MPHF for their needs. The work demonstrates that state-of-the-art MPHFs can be extremely space-efficient (near $1.444$ bits/key) and fast to query, with construction times that scale via partitioning and parallelism, including GPU acceleration, making MPHFs practical for large-scale databases, bioinformatics, and text processing tasks.

Abstract

Given a set $S$ of $n$ keys, a perfect hash function for $S$ maps the keys in $S$ to the first $m \geq n$ integers without collisions. It may return an arbitrary result for any key not in $S$ and is called minimal if $m = n$. The most important parameters are its space consumption, construction time, and query time. Years of research now enable modern perfect hash functions to be extremely fast to query, very space-efficient, and scale to billions of keys. Different approaches give different trade-offs between these aspects. For example, the smallest constructions get within 0.1% of the space lower bound of $\log_2(e)$ bits per key. Others are particularly fast to query, requiring only one memory access. Perfect hashing has many applications, for example to avoid collision resolution in static hash tables, and is used in databases, bioinformatics, and stringology. Since the last comprehensive survey in 1997, significant progress has been made. This survey covers the latest developments and provides a starting point for getting familiar with the topic. Additionally, our extensive experimental evaluation can serve as a guide to select a perfect hash function for use in applications.

Modern Minimal Perfect Hashing: A Survey

TL;DR

This survey addresses the problem of constructing minimal perfect hash functions (MPHFs) that map a known set of n keys to exactly [n] with no collisions, while using space near the information-theoretic lower bound . It systematically categorizes modern MPHFs into retrieval-based, brute-force, and fingerprinting approaches, analyzes their space, construction time, and query performance, and provides a comprehensive experimental comparison across billions of keys. Key contributions include a detailed taxonomy, clarified relationship to graph-theoretic notions like peelability and orientability, and extensive benchmarks that guide practitioners in selecting the right MPHF for their needs. The work demonstrates that state-of-the-art MPHFs can be extremely space-efficient (near bits/key) and fast to query, with construction times that scale via partitioning and parallelism, including GPU acceleration, making MPHFs practical for large-scale databases, bioinformatics, and text processing tasks.

Abstract

Given a set of keys, a perfect hash function for maps the keys in to the first integers without collisions. It may return an arbitrary result for any key not in and is called minimal if . The most important parameters are its space consumption, construction time, and query time. Years of research now enable modern perfect hash functions to be extremely fast to query, very space-efficient, and scale to billions of keys. Different approaches give different trade-offs between these aspects. For example, the smallest constructions get within 0.1% of the space lower bound of bits per key. Others are particularly fast to query, requiring only one memory access. Perfect hashing has many applications, for example to avoid collision resolution in static hash tables, and is used in databases, bioinformatics, and stringology. Since the last comprehensive survey in 1997, significant progress has been made. This survey covers the latest developments and provides a starting point for getting familiar with the topic. Additionally, our extensive experimental evaluation can serve as a guide to select a perfect hash function for use in applications.

Paper Structure

This paper contains 101 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Perfect hashing approaches and how they influence each other. While we describe them in this paper, we do not evaluate the performance of approaches given in gray color because they are clearly larger than current ones or because they do not have a publicly available implementation.
  • Figure 2: Multiple Choice Hashing. Each key has the choice between several locations, in this case two. A retrieval data structure stores which of the choices to take.
  • Figure 3: Illustration of perfect hashing through bucket placement, where the $n=6$ keys are first mapped to buckets and then, for each bucket $i$, a seed value $v_i$ is computed to place the keys in the bucket without collisions to the output domain $[m=8]$. Here, $h$ has range $[m]$.
  • Figure 4: Relative sizes of different buckets when using various bucket mapping functions. The legend gives the original paper even if later techniques use the same functions. Adapted from hermann2024phobic.
  • Figure 5: Illustration of the overall RecSplit data structure. Within each bucket, it constructs a splitting tree. Circular nodes represent splittings, squares represent bijections.
  • ...and 7 more figures