Table of Contents
Fetching ...

Simplified Tight Bounds for Monotone Minimal Perfect Hashing

Dmitry Kosolobov

TL;DR

This work resolves the space complexity of monotone minimal perfect hash functions (MMPHFs) across the practical parameter range by establishing a tight lower bound of $\Omega\bigl(n \,\min\{\log\log\log \frac{u}{n}, \log n\}\bigr)$ bits for $u \ge (1+\varepsilon)n$, and showing this bound is achievable via a straightforward extension of Belazzougui et al.'s construction. The authors simplify Assadi et al.'s proof by removing heavy combinatorial machinery and rely primarily on probabilistic coloring arguments, complemented by a detailed, though non-novel, core component that matches previous techniques. They also provide a reduction to very large universe sizes to extend the bound across a broad range of $u$, and elucidate tight upper bounds via a bucketed, concatenated-structure construction that achieves the same space bound in} the tight regime. The results virtually settle the MMPHF space-usage problem for all reasonable $u$, and the approach offers a clearer probabilistic perspective on the interplay between colorings and data-structure encodings with potential applicability to related hashing problems.

Abstract

Given an increasing sequence of integers $x_1,\ldots,x_n$ from a universe $\{0,\ldots,u-1\}$, the monotone minimal perfect hash function (MMPHF) for this sequence is a data structure that answers the following rank queries: $rank(x) = i$ if $x = x_i$, for $i\in \{1,\ldots,n\}$, and $rank(x)$ is arbitrary otherwise. Assadi, Farach-Colton, and Kuszmaul recently presented at SODA'23 a proof of the lower bound $Ω(n \min\{\log\log\log u, \log n\})$ for the bits of space required by MMPHF, provided $u \ge n 2^{2^{\sqrt{\log\log n}}}$, which is tight since there is a data structure for MMPHF that attains this space bound (and answers the queries in $O(\log u)$ time). In this paper, we close the remaining gap by proving that, for $u \ge (1+ε)n$, where $ε> 0$ is any constant, the tight lower bound is $Ω(n \min\{\log\log\log \frac{u}{n}, \log n\})$, which is also attainable; we observe that, for all reasonable cases when $n < u < (1+ε)n$, known facts imply tight bounds, which virtually settles the problem. Along the way we substantially simplify the proof of Assadi et al. replacing a part of their heavy combinatorial machinery by trivial observations. However, an important part of the proof still remains complicated. This part of our paper repeats arguments of Assadi et al. and is not novel. Nevertheless, we include it, for completeness, offering a somewhat different perspective on these arguments.

Simplified Tight Bounds for Monotone Minimal Perfect Hashing

TL;DR

This work resolves the space complexity of monotone minimal perfect hash functions (MMPHFs) across the practical parameter range by establishing a tight lower bound of bits for , and showing this bound is achievable via a straightforward extension of Belazzougui et al.'s construction. The authors simplify Assadi et al.'s proof by removing heavy combinatorial machinery and rely primarily on probabilistic coloring arguments, complemented by a detailed, though non-novel, core component that matches previous techniques. They also provide a reduction to very large universe sizes to extend the bound across a broad range of , and elucidate tight upper bounds via a bucketed, concatenated-structure construction that achieves the same space bound in} the tight regime. The results virtually settle the MMPHF space-usage problem for all reasonable , and the approach offers a clearer probabilistic perspective on the interplay between colorings and data-structure encodings with potential applicability to related hashing problems.

Abstract

Given an increasing sequence of integers from a universe , the monotone minimal perfect hash function (MMPHF) for this sequence is a data structure that answers the following rank queries: if , for , and is arbitrary otherwise. Assadi, Farach-Colton, and Kuszmaul recently presented at SODA'23 a proof of the lower bound for the bits of space required by MMPHF, provided , which is tight since there is a data structure for MMPHF that attains this space bound (and answers the queries in time). In this paper, we close the remaining gap by proving that, for , where is any constant, the tight lower bound is , which is also attainable; we observe that, for all reasonable cases when , known facts imply tight bounds, which virtually settles the problem. Along the way we substantially simplify the proof of Assadi et al. replacing a part of their heavy combinatorial machinery by trivial observations. However, an important part of the proof still remains complicated. This part of our paper repeats arguments of Assadi et al. and is not novel. Nevertheless, we include it, for completeness, offering a somewhat different perspective on these arguments.
Paper Structure (12 sections, 3 equations, 4 figures)

This paper contains 12 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: A schematic image of the first intervals $[\ell_1..\ell'_1)$, $[\ell_2..\ell'_2)$, $[\ell_3..\ell'_3)$, the first blocks $[b_1..b'_1)$, $[b_2..b'_2)$, $[b_3..b'_3)$, and the first elements $x_1, x_2$ generated by our process. The set $[0..u)$ is depicted as the line at the bottom. The left vertical "ruler" depicts some levels (not all): the larger divisions denote the levels that could be chosen as $\ell_2$ and the smaller divisions could be chosen as $\ell_3$. The intervals $[\ell_2..\ell'_2)$ and $[\ell_3..\ell'_3)$ are painted in two shades of gray. For $i \in [1..3]$, each block $[b_i..b'_i)$ is associated with a rectangle that includes all subblocks of $[b_i..b'_i)$ from levels $[\ell_i..\ell'_i)$; the rectangles are painted in shades of blue; we depict inside the rectangle of $[b_i..b'_i)$ lines corresponding to levels that could be chosen as $\ell_{i+1}$ and we outline contours of blocks from the level $\ell_{i+1}$. The elements $x_1, x_2$ are chosen from $[0..u)$ but it is convenient to draw them also on the lines corresponding to the respective levels $\ell_2$ and $\ell_3$, so it is easier to see that the frist level-$\ell_2$ block to the right of $x_1$ is $[b_2..b'_2)$ and the first level-$\ell_3$ block to the right of $x_2$ is $[b_3..b'_3)$.
  • Figure 2: A schematic partition of all blocks into disjoint subsets for a fixed $i \in [1..n)$.
  • Figure 3: The lines depict consecutive levels $[\lambda_k-1..\lambda_{k+1}]$ inside a block $[b_i..b'_i)$; we assume that $[\ell_{i+1}..\ell'_{i+1}) = [\lambda_k..\lambda_{k+1})$. The red regions denote the dense sets $D_{\ell}$, for $\ell \in [\lambda_k-1..\lambda_{k+1})$. The image is supposed to show the case when each such $D_{\ell}$ takes a large portion of $D_{\lambda_k-1}$ , so that $D_{\lambda_k-1}$ might (approximately) serve as our "inherently dense" set $\bar{D}_k$ for level $\lambda_k$ in the block.
  • Figure 4: The lines depict consecutive levels $[\lambda_k-1..\lambda_{k+1}]$ inside a block $[b_i..b'_i)$. The level $\lambda_k$ is emphasized by the blue color. The red region under line representing level $\ell$ depicts $D_{\ell}$. The set $D_{\lambda_k-1}$ consists of three maximal intervals; accordingly, $\bar{D}_k$ is drawn as three thick red lines over $D_{\lambda_k-1}$ (the gap to the right of each line represents the lacking rightmost block from level $\lambda_k$).

Theorems & Definitions (3)

  • Example 1
  • Remark 2
  • Remark 3