Table of Contents
Fetching ...

Fast Construction of Partitioned Learned Bloom Filter with Theoretical Guarantees

Atsuki Sato, Yusuke Matsui

TL;DR

Three methods are proposed: fast PLBF, fast PLBF++, and fast PLBF#, that reduce the construction complexity to $O(N^2k)$, $O(Nk \log N)$, and $O(Nk \log k)$, respectively, and theoretically prove they are equivalent to PLBF under ideal data distribution.

Abstract

Bloom filter is a widely used classic data structure for approximate membership queries. Learned Bloom filters improve memory efficiency by leveraging machine learning, with the partitioned learned Bloom filter (PLBF) being among the most memory-efficient variants. However, PLBF suffers from high computational complexity during construction, specifically $O(N^3k)$, where $N$ and $k$ are hyperparameters. In this paper, we propose three methods: fast PLBF, fast PLBF++, and fast PLBF#, that reduce the construction complexity to $O(N^2k)$, $O(Nk \log N)$, and $O(Nk \log k)$, respectively. Fast PLBF preserves the original PLBF structure and memory efficiency. Although fast PLBF++ and fast PLBF# may have different structures, we theoretically prove they are equivalent to PLBF under ideal data distribution. Furthermore, we theoretically bound the difference in memory efficiency between PLBF and fast PLBF++ for non-ideal scenarios. Experiments on real-world datasets demonstrate that fast PLBF, fast PLBF++, and fast PLBF# are up to 233, 761, and 778 times faster to construct than original PLBF, respectively. Additionally, fast PLBF maintains the same data structure as PLBF, and fast PLBF++ and fast PLBF# achieve nearly identical memory efficiency.

Fast Construction of Partitioned Learned Bloom Filter with Theoretical Guarantees

TL;DR

Three methods are proposed: fast PLBF, fast PLBF++, and fast PLBF#, that reduce the construction complexity to , , and , respectively, and theoretically prove they are equivalent to PLBF under ideal data distribution.

Abstract

Bloom filter is a widely used classic data structure for approximate membership queries. Learned Bloom filters improve memory efficiency by leveraging machine learning, with the partitioned learned Bloom filter (PLBF) being among the most memory-efficient variants. However, PLBF suffers from high computational complexity during construction, specifically , where and are hyperparameters. In this paper, we propose three methods: fast PLBF, fast PLBF++, and fast PLBF#, that reduce the construction complexity to , , and , respectively. Fast PLBF preserves the original PLBF structure and memory efficiency. Although fast PLBF++ and fast PLBF# may have different structures, we theoretically prove they are equivalent to PLBF under ideal data distribution. Furthermore, we theoretically bound the difference in memory efficiency between PLBF and fast PLBF++ for non-ideal scenarios. Experiments on real-world datasets demonstrate that fast PLBF, fast PLBF++, and fast PLBF# are up to 233, 761, and 778 times faster to construct than original PLBF, respectively. Additionally, fast PLBF maintains the same data structure as PLBF, and fast PLBF++ and fast PLBF# achieve nearly identical memory efficiency.

Paper Structure

This paper contains 23 sections, 4 theorems, 54 equations, 16 figures, 1 table, 3 algorithms.

Key Result

Theorem 5

There exists an optimal solution $\bm{f}$ to the optimization problem (Equation equ:prob) that satisfies for all $i, ~ j$ such that $f_i < 1, ~ f_j = 1$. (Here, when $H_i=0$, $G_i/H_i=\infty$, which is greater than any finite value.)

Figures (16)

  • Figure 1: PLBF partitions the score space into $k$ regions and assigns backup Bloom filters with different FPRs to each region.
  • Figure 2: PLBF divides the score space into $N$ segments and then clusters the $N$ segments into $k$ regions. PLBF uses dynamic programming to find the optimal way to cluster segments into regions.
  • Figure 3: Example of a matrix problem for a monotone matrix.
  • Figure 4: Monotone maxima. \ref{['fig: monotone_maxima_a']} exhaustive search on the middle row, and \ref{['fig: monotone_maxima_b']} narrowing the search area \ref{['fig: monotone_maxima_c']} are repeated recursively.
  • Figure 5: When the number of regions is fixed at $q$ and the number of segments is increased by $1$, the optimal $t_{q-1}$ remains unchanged or increases if $A$ is a monotone matrix\ref{['fig: TransMonoIdea_monotone']}. When $A$ is not a monotone matrix\ref{['fig: TransMonoIdea_not_monotone']}, the optimal $t_{q-1}$ may decrease.
  • ...and 11 more figures

Theorems & Definitions (8)

  • Definition 2: matrix problem
  • Definition 3: monotone matrix
  • Definition 4: totally monotone matrix
  • Theorem 5
  • Definition 6
  • Lemma 7
  • Theorem 8
  • Theorem 9