Table of Contents
Fetching ...

Cascaded Learned Bloom Filter for Optimal Model-Filter Size Balance and Fast Rejection

Atsuki Sato, Yusuke Matsui

TL;DR

The paper tackles efficient approximate membership querying by addressing two weaknesses of learned Bloom filters: suboptimal balance between the machine learning model size and Bloom-filter size, and non-ideal reject times. It introduces the Cascaded Learned Bloom Filter (CLBF), a cascaded architecture that alternates between score-based branching from multiple ML stages and Bloom-filter filtering, optimized via dynamic programming to minimize a weighted combination of memory and expected reject time under a target false-positive rate $F$. The authors formulate precise memory and latency objectives, define a tractable DP routine with complexity $\mathcal{O}(\bar{D}P^2 + \bar{D}PK)$, and discretize key parameters to enable practical optimization. Empirical results on Malicious URLs and EMBER datasets show that CLBF reduces memory usage by up to 24% and reduces reject time by up to 14x compared with PLBF, demonstrating improved memory efficiency and faster rejections suitable for latency-sensitive, memory-constrained applications.

Abstract

Recent studies have demonstrated that learned Bloom filters, which combine machine learning with the classical Bloom filter, can achieve superior memory efficiency. However, existing learned Bloom filters face two critical unresolved challenges: the balance between the machine learning model size and the Bloom filter size is not optimal, and the reject time cannot be minimized effectively. We propose the Cascaded Learned Bloom Filter (CLBF) to address these issues. Our dynamic programming-based optimization automatically selects configurations that achieve an optimal balance between the model and filter sizes while minimizing reject time. Experiments on real-world datasets show that CLBF reduces memory usage by up to 24% and decreases reject time by up to 14 times compared to state-of-the-art learned Bloom filters.

Cascaded Learned Bloom Filter for Optimal Model-Filter Size Balance and Fast Rejection

TL;DR

The paper tackles efficient approximate membership querying by addressing two weaknesses of learned Bloom filters: suboptimal balance between the machine learning model size and Bloom-filter size, and non-ideal reject times. It introduces the Cascaded Learned Bloom Filter (CLBF), a cascaded architecture that alternates between score-based branching from multiple ML stages and Bloom-filter filtering, optimized via dynamic programming to minimize a weighted combination of memory and expected reject time under a target false-positive rate . The authors formulate precise memory and latency objectives, define a tractable DP routine with complexity , and discretize key parameters to enable practical optimization. Empirical results on Malicious URLs and EMBER datasets show that CLBF reduces memory usage by up to 24% and reduces reject time by up to 14x compared with PLBF, demonstrating improved memory efficiency and faster rejections suitable for latency-sensitive, memory-constrained applications.

Abstract

Recent studies have demonstrated that learned Bloom filters, which combine machine learning with the classical Bloom filter, can achieve superior memory efficiency. However, existing learned Bloom filters face two critical unresolved challenges: the balance between the machine learning model size and the Bloom filter size is not optimal, and the reject time cannot be minimized effectively. We propose the Cascaded Learned Bloom Filter (CLBF) to address these issues. Our dynamic programming-based optimization automatically selects configurations that achieve an optimal balance between the model and filter sizes while minimizing reject time. Experiments on real-world datasets show that CLBF reduces memory usage by up to 24% and decreases reject time by up to 14 times compared to state-of-the-art learned Bloom filters.

Paper Structure

This paper contains 27 sections, 13 equations, 17 figures, 2 algorithms.

Figures (17)

  • Figure 1: The architecture of Existing LBFs: (a) Naive LBF kraska2018case has a single backup Bloom filter. (b) Sandwiched LBF mitzenmacher2018model applies a pre-filter before the model inference. (c) PLBF vaidya2021partitioned uses multiple backup Bloom filters.
  • Figure 2: The architecture of CLBF: CLBF alternates between score-based branching and Bloom filter-based filtering. This design generalizes the architectures of sandwiched LBF and PLBF. Note that $g^{(*)}_{*}$ and $h^{(*)}_{*}$ represent the proportions of keys and non-keys passing through each root when filtering using $\mathrm{TBF}$s is not performed.
  • Figure 3: $\hat{\mathrm{dp}}(d, T)$ is the minimum objective function value under $\mathrm{TBF}_d$ subject to the constraint that $\prod_{j=1}^{d-1} f^{(t)}_j = T$ and $D = d$.
  • Figure 4: $\check{\mathrm{dp}}(d, T)$ is the minimum objective function value under $\mathrm{TBF}_d$ subject to the constraint that $\prod_{j=1}^{d-1} f^{(t)}_j = T$ and $D > d$.
  • Figure 6: Trade-off between memory usage and accuracy (lower-left is better): CLBF achieves equal to or better memory efficiency than any other PLBF with $D$.
  • ...and 12 more figures