Table of Contents
Fetching ...

PHast -- Perfect Hashing made fast

Piotr Beling, Peter Sanders

TL;DR

PHast tackles the challenge of ultra-fast queries for perfect hash functions while keeping space near the information-theoretic minimum. It introduces a bucket-placement framework with fixed-width per-bucket seeds and a bumping mechanism, plus a PHast+ variant that uses additive placement for bit-parallel seed searching, yielding sub-2 bits per key and strong practical performance. Through extensive benchmarks, PHast and PHast+ demonstrate fast query evaluation and favorable space/construction-time trade-offs against state-of-the-art MPHFs, aided by cache-friendly layout and parallel construction. The work also outlines external-memory extensions and avenues for GPU acceleration and k-perfect hashing, making PHast a practical, scalable solution for large static datasets.

Abstract

Perfect hash functions give unique "names" to arbitrary keys requiring only a few bits per key. This is an essential building block in applications like static hash tables, databases, or bioinformatics. This paper introduces the PHast approach that combines the fastest available queries, very fast construction, and good space consumption (below 2 bits per key). PHast improves bucket-placement which first hashes each key k to a bucket, and then looks for the bucket seed s such that a placement function maps pairs (s,k) in a collision-free way. PHast can use small-range hash functions with linear mapping, fixed-width encoding of seeds, and parallel construction. This is achieved using small overlapping slices of allowed values and bumping to handle unsuccessful seed assignment. A variant we called PHast+ uses additive placement, which enables bit-parallel seed searching, speeding up the construction by an order of magnitude.

PHast -- Perfect Hashing made fast

TL;DR

PHast tackles the challenge of ultra-fast queries for perfect hash functions while keeping space near the information-theoretic minimum. It introduces a bucket-placement framework with fixed-width per-bucket seeds and a bumping mechanism, plus a PHast+ variant that uses additive placement for bit-parallel seed searching, yielding sub-2 bits per key and strong practical performance. Through extensive benchmarks, PHast and PHast+ demonstrate fast query evaluation and favorable space/construction-time trade-offs against state-of-the-art MPHFs, aided by cache-friendly layout and parallel construction. The work also outlines external-memory extensions and avenues for GPU acceleration and k-perfect hashing, making PHast a practical, scalable solution for large static datasets.

Abstract

Perfect hash functions give unique "names" to arbitrary keys requiring only a few bits per key. This is an essential building block in applications like static hash tables, databases, or bioinformatics. This paper introduces the PHast approach that combines the fastest available queries, very fast construction, and good space consumption (below 2 bits per key). PHast improves bucket-placement which first hashes each key k to a bucket, and then looks for the bucket seed s such that a placement function maps pairs (s,k) in a collision-free way. PHast can use small-range hash functions with linear mapping, fixed-width encoding of seeds, and parallel construction. This is achieved using small overlapping slices of allowed values and bumping to handle unsuccessful seed assignment. A variant we called PHast+ uses additive placement, which enables bit-parallel seed searching, speeding up the construction by an order of magnitude.

Paper Structure

This paper contains 17 sections, 3 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 3.1: The seed is assigned to a bucket selected from a window of 5 buckets in the figure (and 256 in our actual implementation). Initially, the window includes buckets with low indexes (on the left in the figure) that cover low function values. Once the seed is assigned to the first (leftmost) bucket in the window, a shift towards higher values (right) is performed, so that the first bucket in the window has no seed (see bottom of figure). To check for collisions, the algorithm uses a cyclic fixed-size bitmap of used/free values from the range covered by the window.
  • Figure 3.2: Plots of $\ell$ functions for selected seed sizes $S$. The $\ell$ function is the size-dependent component of bucket selection priority. For buckets with more than $7$ keys, it grows linearly, with the same progression as from size $6$ to $7$.
  • Figure 3.3: PHast$^+$ tests the feasibility of $8$ ($64$ in our actual implementation) seeds ($s, \dots, s+7$) at once, for a bucket containing hash codes $c_0$, $c_1$, $c_2$.
  • Figure 4.1: Multithreaded seed assignment. Thanks to the gaps, no communication is required between threads filling separate chunks of the seed array.
  • Figure 5.1: The top plot shows how PHast size depends on bucket size ($\lambda$), for seed sizes ($S$) from $4$ to $12$. The black dots indicate the minimum sizes, which are also given in the table. The bottom plot illustrates the impact of $\lambda$ on construction and query times for $S=8$. To speed up queries, one can use $\lambda$ somewhat lower than this minimizing size.
  • ...and 6 more figures