Table of Contents
Fetching ...

Daisy Bloom Filters

Ioana O. Bercea, Jakob Bæk Tejs Houen, Rasmus Pagh

TL;DR

This work studies optimal Bloom-filter design under distributional data and query models, aiming to minimize space while maintaining efficient membership operations. It introduces a distribution-aware lower bound $LB(\mathcal{P}_n,\mathcal{Q},\varepsilon)$ and shows it tightly characterizes the space needed by any $(\mathcal{Q},\varepsilon)$-filter for inputs drawn from $\mathcal{P}_n$. Building on this, the authors present a space-efficient filter that matches the lower bound up to additive terms in $n$ and achieves worst-case constant-time operations, and they introduce the Daisy Bloom filter, which further reduces space to $\log(e)\cdot LB(\mathcal{P}_n,\mathcal{Q},\varepsilon) + O(n)$ bits with worst-case time $\lceil \log(1/\varepsilon)\rceil$ per operation. The approach partitions the universe by the ratio $q_x/p_x$, leverages multiple standard filters for frequent partitions, and employs concentration results (Bernstein) to guarantee high-probability performance over input sets drawn from $\mathcal{P}_n$. Overall, the paper advances the design of space-efficient, distribution-aware Bloom filters with provable optimality and practical worst-case performance guarantees under realistic distributional assumptions.

Abstract

A filter is a widely used data structure for storing an approximation of a given set $S$ of elements from some universe $U$ (a countable set).It represents a superset $S'\supseteq S$ that is ''close to $S$'' in the sense that for $x\not\in S$, the probability that $x\in S'$ is bounded by some $\varepsilon > 0$. The advantage of using a Bloom filter, when some false positives are acceptable, is that the space usage becomes smaller than what is required to store $S$ exactly. Though filters are well-understood from a worst-case perspective, it is clear that state-of-the-art constructions may not be close to optimal for particular distributions of data and queries. Suppose, for instance, that some elements are in $S$ with probability close to 1. Then it would make sense to always include them in $S'$, saving space by not having to represent these elements in the filter. Questions like this have been raised in the context of Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) and Bloom filter implementations that make use of access to learned components (Vaidya, Knorr, Mitzenmacher, and Krask, ICLR 2021). In this paper, we present a lower bound for the expected space that such a filter requires. We also show that the lower bound is asymptotically tight by exhibiting a filter construction that executes queries and insertions in worst-case constant time, and has a false positive rate at most $\varepsilon $ with high probability over input sets drawn from a product distribution. We also present a Bloom filter alternative, which we call the $\textit{Daisy Bloom filter}$, that executes operations faster and uses significantly less space than the standard Bloom filter.

Daisy Bloom Filters

TL;DR

This work studies optimal Bloom-filter design under distributional data and query models, aiming to minimize space while maintaining efficient membership operations. It introduces a distribution-aware lower bound and shows it tightly characterizes the space needed by any -filter for inputs drawn from . Building on this, the authors present a space-efficient filter that matches the lower bound up to additive terms in and achieves worst-case constant-time operations, and they introduce the Daisy Bloom filter, which further reduces space to bits with worst-case time per operation. The approach partitions the universe by the ratio , leverages multiple standard filters for frequent partitions, and employs concentration results (Bernstein) to guarantee high-probability performance over input sets drawn from . Overall, the paper advances the design of space-efficient, distribution-aware Bloom filters with provable optimality and practical worst-case performance guarantees under realistic distributional assumptions.

Abstract

A filter is a widely used data structure for storing an approximation of a given set of elements from some universe (a countable set).It represents a superset that is ''close to '' in the sense that for , the probability that is bounded by some . The advantage of using a Bloom filter, when some false positives are acceptable, is that the space usage becomes smaller than what is required to store exactly. Though filters are well-understood from a worst-case perspective, it is clear that state-of-the-art constructions may not be close to optimal for particular distributions of data and queries. Suppose, for instance, that some elements are in with probability close to 1. Then it would make sense to always include them in , saving space by not having to represent these elements in the filter. Questions like this have been raised in the context of Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) and Bloom filter implementations that make use of access to learned components (Vaidya, Knorr, Mitzenmacher, and Krask, ICLR 2021). In this paper, we present a lower bound for the expected space that such a filter requires. We also show that the lower bound is asymptotically tight by exhibiting a filter construction that executes queries and insertions in worst-case constant time, and has a false positive rate at most with high probability over input sets drawn from a product distribution. We also present a Bloom filter alternative, which we call the , that executes operations faster and uses significantly less space than the standard Bloom filter.
Paper Structure (12 sections, 13 theorems, 13 equations, 1 figure)

This paper contains 12 sections, 13 theorems, 13 equations, 1 figure.

Key Result

Theorem 2

Let $A$ be an algorithm and assume that for any input set $S \subseteq \mathcal{U}$ with $\left|S\right| \le n$, $A(S)$ is a $(\mathcal{Q},\varepsilon)$-filter for $S$. Then the expected size of $A(S)$ must satisfy where $S$ is sampled with respect to $\mathcal{P}_n$ and the queries are sampled with respect to $\mathcal{Q}$.

Figures (1)

  • Figure 1: A schematic visualization of the different regimes for $k_x$.

Theorems & Definitions (14)

  • Definition 1
  • Theorem 2: Lower bound - simplified
  • Theorem 3: Space-efficient filter - simplified
  • Theorem 4: Daisy Bloom filter - simplified
  • Theorem 5: Kraft's inequality thomas2006elements
  • Theorem 6
  • Theorem 7
  • Lemma 8
  • Lemma 9
  • Lemma 10
  • ...and 4 more