Daisy Bloom Filters

Ioana O. Bercea; Jakob Bæk Tejs Houen; Rasmus Pagh

Daisy Bloom Filters

Ioana O. Bercea, Jakob Bæk Tejs Houen, Rasmus Pagh

TL;DR

This work studies optimal Bloom-filter design under distributional data and query models, aiming to minimize space while maintaining efficient membership operations. It introduces a distribution-aware lower bound $LB(\mathcal{P}_n,\mathcal{Q},\varepsilon)$ and shows it tightly characterizes the space needed by any $(\mathcal{Q},\varepsilon)$-filter for inputs drawn from $\mathcal{P}_n$. Building on this, the authors present a space-efficient filter that matches the lower bound up to additive terms in $n$ and achieves worst-case constant-time operations, and they introduce the Daisy Bloom filter, which further reduces space to $\log(e)\cdot LB(\mathcal{P}_n,\mathcal{Q},\varepsilon) + O(n)$ bits with worst-case time $\lceil \log(1/\varepsilon)\rceil$ per operation. The approach partitions the universe by the ratio $q_x/p_x$, leverages multiple standard filters for frequent partitions, and employs concentration results (Bernstein) to guarantee high-probability performance over input sets drawn from $\mathcal{P}_n$. Overall, the paper advances the design of space-efficient, distribution-aware Bloom filters with provable optimality and practical worst-case performance guarantees under realistic distributional assumptions.

Abstract

A filter is a widely used data structure for storing an approximation of a given set $S$ of elements from some universe $U$ (a countable set).It represents a superset $S'\supseteq S$ that is ''close to $S$'' in the sense that for $x\not\in S$, the probability that $x\in S'$ is bounded by some $\varepsilon > 0$. The advantage of using a Bloom filter, when some false positives are acceptable, is that the space usage becomes smaller than what is required to store $S$ exactly. Though filters are well-understood from a worst-case perspective, it is clear that state-of-the-art constructions may not be close to optimal for particular distributions of data and queries. Suppose, for instance, that some elements are in $S$ with probability close to 1. Then it would make sense to always include them in $S'$, saving space by not having to represent these elements in the filter. Questions like this have been raised in the context of Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) and Bloom filter implementations that make use of access to learned components (Vaidya, Knorr, Mitzenmacher, and Krask, ICLR 2021). In this paper, we present a lower bound for the expected space that such a filter requires. We also show that the lower bound is asymptotically tight by exhibiting a filter construction that executes queries and insertions in worst-case constant time, and has a false positive rate at most $\varepsilon $ with high probability over input sets drawn from a product distribution. We also present a Bloom filter alternative, which we call the $\textit{Daisy Bloom filter}$, that executes operations faster and uses significantly less space than the standard Bloom filter.

Daisy Bloom Filters

TL;DR

and shows it tightly characterizes the space needed by any

-filter for inputs drawn from

. Building on this, the authors present a space-efficient filter that matches the lower bound up to additive terms in

and achieves worst-case constant-time operations, and they introduce the Daisy Bloom filter, which further reduces space to

bits with worst-case time

per operation. The approach partitions the universe by the ratio

, leverages multiple standard filters for frequent partitions, and employs concentration results (Bernstein) to guarantee high-probability performance over input sets drawn from

. Overall, the paper advances the design of space-efficient, distribution-aware Bloom filters with provable optimality and practical worst-case performance guarantees under realistic distributional assumptions.

Abstract

A filter is a widely used data structure for storing an approximation of a given set

of elements from some universe

(a countable set).It represents a superset

that is ''close to

'' in the sense that for

, the probability that

is bounded by some

. The advantage of using a Bloom filter, when some false positives are acceptable, is that the space usage becomes smaller than what is required to store

exactly. Though filters are well-understood from a worst-case perspective, it is clear that state-of-the-art constructions may not be close to optimal for particular distributions of data and queries. Suppose, for instance, that some elements are in

with probability close to 1. Then it would make sense to always include them in

, saving space by not having to represent these elements in the filter. Questions like this have been raised in the context of Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) and Bloom filter implementations that make use of access to learned components (Vaidya, Knorr, Mitzenmacher, and Krask, ICLR 2021). In this paper, we present a lower bound for the expected space that such a filter requires. We also show that the lower bound is asymptotically tight by exhibiting a filter construction that executes queries and insertions in worst-case constant time, and has a false positive rate at most

with high probability over input sets drawn from a product distribution. We also present a Bloom filter alternative, which we call the

, that executes operations faster and uses significantly less space than the standard Bloom filter.

Paper Structure (12 sections, 13 theorems, 13 equations, 1 figure)

This paper contains 12 sections, 13 theorems, 13 equations, 1 figure.

Introduction
Our Contributions
Related Work
Paper Organization
Preliminaries
The Lower Bound
Space-Efficient Filter
Construction
Analysis
Remarks
The Daisy Bloom Filter Analysis
Choices for delta and gamma in the proof of Lemma Lemma \ref{['boundfprate']}

Key Result

Theorem 2

Let $A$ be an algorithm and assume that for any input set $S \subseteq \mathcal{U}$ with $\left|S\right| \le n$, $A(S)$ is a $(\mathcal{Q},\varepsilon)$-filter for $S$. Then the expected size of $A(S)$ must satisfy where $S$ is sampled with respect to $\mathcal{P}_n$ and the queries are sampled with respect to $\mathcal{Q}$.

Figures (1)

Figure 1: A schematic visualization of the different regimes for $k_x$.

Theorems & Definitions (14)

Definition 1
Theorem 2: Lower bound - simplified
Theorem 3: Space-efficient filter - simplified
Theorem 4: Daisy Bloom filter - simplified
Theorem 5: Kraft's inequality thomas2006elements
Theorem 6
Theorem 7
Lemma 8
Lemma 9
Lemma 10
...and 4 more

Daisy Bloom Filters

TL;DR

Abstract

Daisy Bloom Filters

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (14)