Table of Contents
Fetching ...

Adaptive Hashing: Faster Hash Functions with Fewer Collisions

Gábor Melis

TL;DR

This work argues that fixing a hash function for the lifetime of a hash table is suboptimal and introduces online adaptive hashing that tunes the function to the evolving key set with minimal overhead and no API changes. It formalizes a cost framework and uses rehashing as the mechanism to switch between Hash variants (Constant, Arithmetic, Pointer-based mixes) guided by observed collisions and max-chain-length. The authors demonstrate substantial gains for string and integer/pointer keys, including cases where the adaptive approach acts like a perfect hash while retaining robustness against worst-case scenarios. Empirical results from SBCL show practical improvements in both microbenchmarks and macrobenchmarks, with open-source implementations enabling reproducibility and further exploration of adaptive strategies.

Abstract

Hash tables are ubiquitous, and the choice of hash function, which maps a key to a bucket, is key to their performance. We argue that the predominant approach of fixing the hash function for the lifetime of the hash table is suboptimal and propose adapting it to the current set of keys. In the prevailing view, good hash functions spread the keys ``randomly'' and are fast to evaluate. General-purpose ones (e.g. Murmur) are designed to do both while remaining agnostic to the distribution of the keys, which limits their bucketing ability and wastes computation. When these shortcomings are recognized, one may specify a hash function more tailored to some assumed key distribution, but doing so almost always introduces an unbounded risk in case this assumption does not bear out in practice. At the other, fully key-aware end of the spectrum, Perfect Hashing algorithms can discover hash functions to bucket a given set of keys optimally, but they are costly to run and require the keys to be known and fixed ahead of time. Our main conceptual contribution is that adapting the hash table's hash function to the keys online is necessary for the best performance, as adaptivity allows for better bucketing of keys \emph{and} faster hash functions. We instantiate the idea of online adaptation with minimal overhead and no change to the hash table API. The experiments show that the adaptive approach marries the common-case performance of weak hash functions with the robustness of general-purpose ones.

Adaptive Hashing: Faster Hash Functions with Fewer Collisions

TL;DR

This work argues that fixing a hash function for the lifetime of a hash table is suboptimal and introduces online adaptive hashing that tunes the function to the evolving key set with minimal overhead and no API changes. It formalizes a cost framework and uses rehashing as the mechanism to switch between Hash variants (Constant, Arithmetic, Pointer-based mixes) guided by observed collisions and max-chain-length. The authors demonstrate substantial gains for string and integer/pointer keys, including cases where the adaptive approach acts like a perfect hash while retaining robustness against worst-case scenarios. Empirical results from SBCL show practical improvements in both microbenchmarks and macrobenchmarks, with open-source implementations enabling reproducibility and further exploration of adaptive strategies.

Abstract

Hash tables are ubiquitous, and the choice of hash function, which maps a key to a bucket, is key to their performance. We argue that the predominant approach of fixing the hash function for the lifetime of the hash table is suboptimal and propose adapting it to the current set of keys. In the prevailing view, good hash functions spread the keys ``randomly'' and are fast to evaluate. General-purpose ones (e.g. Murmur) are designed to do both while remaining agnostic to the distribution of the keys, which limits their bucketing ability and wastes computation. When these shortcomings are recognized, one may specify a hash function more tailored to some assumed key distribution, but doing so almost always introduces an unbounded risk in case this assumption does not bear out in practice. At the other, fully key-aware end of the spectrum, Perfect Hashing algorithms can discover hash functions to bucket a given set of keys optimally, but they are costly to run and require the keys to be known and fixed ahead of time. Our main conceptual contribution is that adapting the hash table's hash function to the keys online is necessary for the best performance, as adaptivity allows for better bucketing of keys \emph{and} faster hash functions. We instantiate the idea of online adaptation with minimal overhead and no change to the hash table API. The experiments show that the adaptive approach marries the common-case performance of weak hash functions with the robustness of general-purpose ones.
Paper Structure (33 sections, 8 theorems, 14 equations, 44 figures, 3 tables, 4 algorithms)

This paper contains 33 sections, 8 theorems, 14 equations, 44 figures, 3 tables, 4 algorithms.

Key Result

Proposition 3

[Minimal Cost] Let $U(n,m)$ be the bucket count vector of any perfect hash of $n$ keys and $m$ buckets. Let $q=\lfloor n/m\rfloor$ and $r=n \bmod m$. Then, and this cost is minimal.

Figures (44)

  • Figure 1: Regret (\ref{['def:regret']}) with string keys. Adaptive does not gain or significantly compromise on regret. Points where the truncation limit changes vary between runs.
  • Figure 2: PUT timings in nanoseconds with string keys. Note the log scales. The plot shows the average time for inserting a new key when populating an empty hash table with a given number of keys.
  • Figure 3: GET timings with string keys.
  • Figure 4: Regret with FIXNUM:PROG 1. Murmur closely tracks Uniform. Prefuzz is aggressively optimized for small sizes. Adaptive (\ref{['alg:rehash-eq']}) is a perfect hash here. Both Co+Pr (Constant followed by Prefuzz) and Adaptive use the Constant hash until the fixed switch point at 32 keys (black dot).
  • Figure 5: PUT timings with FIXNUM:PROG 1. Prefuzz outperforms Murmur even at large sizes despite higher regret because it's friendlier to the cache (its collisions are between subsequent elements of the progression), and its combination with Constant is even faster. Thus, despite being a perfect hash, Adaptive can improve on them only marginally.
  • ...and 39 more figures

Theorems & Definitions (16)

  • Definition 1: Bucket Count
  • Definition 2: Cost of Hashes
  • Definition 3: Perfect Hash
  • Proposition 3
  • Definition 4: Regret of Hashes
  • Proposition 4
  • Proposition 4
  • Proposition 4
  • Proposition 4
  • proof
  • ...and 6 more