Simplified Tight Bounds for Monotone Minimal Perfect Hashing

Dmitry Kosolobov

Simplified Tight Bounds for Monotone Minimal Perfect Hashing

Dmitry Kosolobov

TL;DR

This work resolves the space complexity of monotone minimal perfect hash functions (MMPHFs) across the practical parameter range by establishing a tight lower bound of $\Omega\bigl(n \,\min\{\log\log\log \frac{u}{n}, \log n\}\bigr)$ bits for $u \ge (1+\varepsilon)n$, and showing this bound is achievable via a straightforward extension of Belazzougui et al.'s construction. The authors simplify Assadi et al.'s proof by removing heavy combinatorial machinery and rely primarily on probabilistic coloring arguments, complemented by a detailed, though non-novel, core component that matches previous techniques. They also provide a reduction to very large universe sizes to extend the bound across a broad range of $u$, and elucidate tight upper bounds via a bucketed, concatenated-structure construction that achieves the same space bound in} the tight regime. The results virtually settle the MMPHF space-usage problem for all reasonable $u$, and the approach offers a clearer probabilistic perspective on the interplay between colorings and data-structure encodings with potential applicability to related hashing problems.

Abstract

Given an increasing sequence of integers $x_1,\ldots,x_n$ from a universe $\{0,\ldots,u-1\}$, the monotone minimal perfect hash function (MMPHF) for this sequence is a data structure that answers the following rank queries: $rank(x) = i$ if $x = x_i$, for $i\in \{1,\ldots,n\}$, and $rank(x)$ is arbitrary otherwise. Assadi, Farach-Colton, and Kuszmaul recently presented at SODA'23 a proof of the lower bound $Ω(n \min\{\log\log\log u, \log n\})$ for the bits of space required by MMPHF, provided $u \ge n 2^{2^{\sqrt{\log\log n}}}$, which is tight since there is a data structure for MMPHF that attains this space bound (and answers the queries in $O(\log u)$ time). In this paper, we close the remaining gap by proving that, for $u \ge (1+ε)n$, where $ε> 0$ is any constant, the tight lower bound is $Ω(n \min\{\log\log\log \frac{u}{n}, \log n\})$, which is also attainable; we observe that, for all reasonable cases when $n < u < (1+ε)n$, known facts imply tight bounds, which virtually settles the problem. Along the way we substantially simplify the proof of Assadi et al. replacing a part of their heavy combinatorial machinery by trivial observations. However, an important part of the proof still remains complicated. This part of our paper repeats arguments of Assadi et al. and is not novel. Nevertheless, we include it, for completeness, offering a somewhat different perspective on these arguments.

Simplified Tight Bounds for Monotone Minimal Perfect Hashing

TL;DR

This work resolves the space complexity of monotone minimal perfect hash functions (MMPHFs) across the practical parameter range by establishing a tight lower bound of

bits for

, and showing this bound is achievable via a straightforward extension of Belazzougui et al.'s construction. The authors simplify Assadi et al.'s proof by removing heavy combinatorial machinery and rely primarily on probabilistic coloring arguments, complemented by a detailed, though non-novel, core component that matches previous techniques. They also provide a reduction to very large universe sizes to extend the bound across a broad range of

, and elucidate tight upper bounds via a bucketed, concatenated-structure construction that achieves the same space bound in} the tight regime. The results virtually settle the MMPHF space-usage problem for all reasonable

, and the approach offers a clearer probabilistic perspective on the interplay between colorings and data-structure encodings with potential applicability to related hashing problems.

Abstract

Given an increasing sequence of integers

from a universe

, the monotone minimal perfect hash function (MMPHF) for this sequence is a data structure that answers the following rank queries:

, for

, and

is arbitrary otherwise. Assadi, Farach-Colton, and Kuszmaul recently presented at SODA'23 a proof of the lower bound

for the bits of space required by MMPHF, provided

, which is tight since there is a data structure for MMPHF that attains this space bound (and answers the queries in

time). In this paper, we close the remaining gap by proving that, for

, where

is any constant, the tight lower bound is

, which is also attainable; we observe that, for all reasonable cases when

, known facts imply tight bounds, which virtually settles the problem. Along the way we substantially simplify the proof of Assadi et al. replacing a part of their heavy combinatorial machinery by trivial observations. However, an important part of the proof still remains complicated. This part of our paper repeats arguments of Assadi et al. and is not novel. Nevertheless, we include it, for completeness, offering a somewhat different perspective on these arguments.

Paper Structure (12 sections, 3 equations, 4 figures)

This paper contains 12 sections, 3 equations, 4 figures.

Introduction
Tight Upper Bounds
From Data Structures to Colorings
Coloring of Random Sequences
Reduction of arbitrary $u$ to very large $u$.
Random Sequences on Large Universes
Definition of the random process
Analysis of the random process
Analysis plan.
Constructing $\bar{S}$ and $\bar{D}$.
Probability to end up in an abnormal block.
Probability of correct coloring.

Figures (4)

Figure 1: A schematic image of the first intervals $[\ell_1..\ell'_1)$, $[\ell_2..\ell'_2)$, $[\ell_3..\ell'_3)$, the first blocks $[b_1..b'_1)$, $[b_2..b'_2)$, $[b_3..b'_3)$, and the first elements $x_1, x_2$ generated by our process. The set $[0..u)$ is depicted as the line at the bottom. The left vertical "ruler" depicts some levels (not all): the larger divisions denote the levels that could be chosen as $\ell_2$ and the smaller divisions could be chosen as $\ell_3$. The intervals $[\ell_2..\ell'_2)$ and $[\ell_3..\ell'_3)$ are painted in two shades of gray. For $i \in [1..3]$, each block $[b_i..b'_i)$ is associated with a rectangle that includes all subblocks of $[b_i..b'_i)$ from levels $[\ell_i..\ell'_i)$; the rectangles are painted in shades of blue; we depict inside the rectangle of $[b_i..b'_i)$ lines corresponding to levels that could be chosen as $\ell_{i+1}$ and we outline contours of blocks from the level $\ell_{i+1}$. The elements $x_1, x_2$ are chosen from $[0..u)$ but it is convenient to draw them also on the lines corresponding to the respective levels $\ell_2$ and $\ell_3$, so it is easier to see that the frist level-$\ell_2$ block to the right of $x_1$ is $[b_2..b'_2)$ and the first level-$\ell_3$ block to the right of $x_2$ is $[b_3..b'_3)$.
Figure 2: A schematic partition of all blocks into disjoint subsets for a fixed $i \in [1..n)$.
Figure 3: The lines depict consecutive levels $[\lambda_k-1..\lambda_{k+1}]$ inside a block $[b_i..b'_i)$; we assume that $[\ell_{i+1}..\ell'_{i+1}) = [\lambda_k..\lambda_{k+1})$. The red regions denote the dense sets $D_{\ell}$, for $\ell \in [\lambda_k-1..\lambda_{k+1})$. The image is supposed to show the case when each such $D_{\ell}$ takes a large portion of $D_{\lambda_k-1}$ , so that $D_{\lambda_k-1}$ might (approximately) serve as our "inherently dense" set $\bar{D}_k$ for level $\lambda_k$ in the block.
Figure 4: The lines depict consecutive levels $[\lambda_k-1..\lambda_{k+1}]$ inside a block $[b_i..b'_i)$. The level $\lambda_k$ is emphasized by the blue color. The red region under line representing level $\ell$ depicts $D_{\ell}$. The set $D_{\lambda_k-1}$ consists of three maximal intervals; accordingly, $\bar{D}_k$ is drawn as three thick red lines over $D_{\lambda_k-1}$ (the gap to the right of each line represents the lacking rightmost block from level $\lambda_k$).

Theorems & Definitions (3)

Example 1
Remark 2
Remark 3

Simplified Tight Bounds for Monotone Minimal Perfect Hashing

TL;DR

Abstract

Simplified Tight Bounds for Monotone Minimal Perfect Hashing

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (3)