Table of Contents
Fetching ...

DPBloomfilter: Securing Bloom Filters with Differential Privacy

Yekun Ke, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song

TL;DR

This work targets privacy leakage in Bloom filters used for membership queries by integrating a differential privacy mechanism. It introduces DPBloomfilter, which applies random response to every Bloom filter bit, achieving $(\epsilon,\delta)$-DP while preserving the standard Bloom filter’s running time. The authors provide per-bit privacy proofs, a quantified DP budgeting via a quantified bit-change variable $W$, and utility analyses that bound the impact on query accuracy. They accompany the theory with extensive simulations showing high utility under practical DP budgets and clear behavior as DP parameters vary. Overall, DPBloomfilter is the first approach to furnish differential privacy guarantees for Bloom filter membership queries without sacrificing efficiency, with meaningful implications for privacy-preserving large-scale data processing.

Abstract

The Bloom filter is a simple yet space-efficient probabilistic data structure that supports membership queries for dramatically large datasets. It is widely utilized and implemented across various industrial scenarios, often handling massive datasets that include sensitive user information necessitating privacy preservation. To address the challenge of maintaining privacy within the Bloom filter, we have developed the DPBloomfilter. This innovation integrates the classical differential privacy mechanism, specifically the Random Response technique, into the Bloom filter, offering robust privacy guarantees under the same running complexity as the standard Bloom filter. Through rigorous simulation experiments, we have demonstrated that our DPBloomfilter algorithm maintains high utility while ensuring privacy protections. To the best of our knowledge, this is the first work to provide differential privacy guarantees for the Bloom filter for membership query problems.

DPBloomfilter: Securing Bloom Filters with Differential Privacy

TL;DR

This work targets privacy leakage in Bloom filters used for membership queries by integrating a differential privacy mechanism. It introduces DPBloomfilter, which applies random response to every Bloom filter bit, achieving -DP while preserving the standard Bloom filter’s running time. The authors provide per-bit privacy proofs, a quantified DP budgeting via a quantified bit-change variable , and utility analyses that bound the impact on query accuracy. They accompany the theory with extensive simulations showing high utility under practical DP budgets and clear behavior as DP parameters vary. Overall, DPBloomfilter is the first approach to furnish differential privacy guarantees for Bloom filter membership queries without sacrificing efficiency, with meaningful implications for privacy-preserving large-scale data processing.

Abstract

The Bloom filter is a simple yet space-efficient probabilistic data structure that supports membership queries for dramatically large datasets. It is widely utilized and implemented across various industrial scenarios, often handling massive datasets that include sensitive user information necessitating privacy preservation. To address the challenge of maintaining privacy within the Bloom filter, we have developed the DPBloomfilter. This innovation integrates the classical differential privacy mechanism, specifically the Random Response technique, into the Bloom filter, offering robust privacy guarantees under the same running complexity as the standard Bloom filter. Through rigorous simulation experiments, we have demonstrated that our DPBloomfilter algorithm maintains high utility while ensuring privacy protections. To the best of our knowledge, this is the first work to provide differential privacy guarantees for the Bloom filter for membership query problems.

Paper Structure

This paper contains 36 sections, 15 theorems, 59 equations, 4 figures, 1 algorithm.

Key Result

Lemma 3.5

Let $M_1$ be an $(\epsilon_1,\delta_1)$-DP algorithm and $M_2$ be an $(\epsilon_2,\delta_2)$-DP algorithm. Then $M(X) = (M_1(X),M_2(M_1(X),X)$ is an $(\epsilon_1+\epsilon_2,\delta_1+\delta_2)$-DP algorithm.

Figures (4)

  • Figure 1: Let $W := |S|$ denote the number of bits in the Bloom filter changed by substituting an element in the inserted set $A$ (Definition \ref{['def:pre_neighbor_dataset']}). We achieve $\epsilon_0$-DP for each single bit and $(\epsilon, \delta)$-DP for the entire Bloom filter via the random response (Definition \ref{['def:random_response']}), where $\epsilon_0 = \epsilon / N$. The $N$ is $1 - \delta$ quantile of the random variable $W$. We visualize the distribution of the random variable $W$ (see Lemma \ref{['lem:distribution_of_W']}) under the setting described in the experiments section (Section \ref{['sec:experiments']}). Namely, we have the bit array length in the Bloom filter $m = 2^{19}$, the number of elements inserted into the Bloom filter $|A| = 10^{5}$, and the number of hash functions $k=3$. It can be inferred from this visualization that the values of random variable $W$ have good concentration properties, mostly concentrated around its mean.
  • Figure 2: Three kinds of error rates with different bit-array lengths $m$. We fix the number of inserted elements $|A|=10^5$, the number of hash functions $k = 3$, and $\delta = 0.01$ in $(\epsilon, \delta)$-DP. In the figure, $\log$ denotes $\log_2$. Left: Total error denotes the case when we randomly choose queries from the universe $[n]$; Middle: False negative denotes the case when we randomly choose queries from the set $S$, which represents the set of elements inserted into the DP Bloom filter; Right: False positive denotes the case when we randomly choose queries from the set $\overline{S} = [n] \backslash S$. As $m$ increases, the total error rate and false positive error rate decrease accordingly, while false negative error rate remains constant. As $\epsilon$ approaches $0$, the DP Bloom filter gets closer to random guessing. In this case, the false positive error rate converges to $\frac{1}{2^k}$, and the false negative error rate converges to $1 - \frac{1}{2^k}$. This is consistent with our result in Lemma \ref{['lem:random_guess']} Our DPBloomFilter achieves practical utility when $\epsilon$ is small(e.g. $\epsilon < 10$).
  • Figure 3: Three kinds of error rates with different numbers of inserted elements $|A|$. We fix the length of bit-array $m=2^{19}$, the number of hash functions $k = 3$, and $\delta = 0.01$ in $(\epsilon, \delta)$-DP. As $|A|$ increases, the Total Error Rate and false positive error rate increase accordingly, while the false negative error rate remains constant.
  • Figure 4: Three kinds of error rates with different numbers of hash function $k$. We fix the length of bit-array $m=2^{19}$, the number of inserted elements $|A| = 10^5$, and $\delta = 0.01$ in $(\epsilon, \delta)$-DP. As $k$ increases, the Total Error Rate and false positive error rate decrease accordingly, while the false negative error rate increases accordingly.

Theorems & Definitions (35)

  • Definition 3.1: Bloom Filter, b70
  • Definition 3.2: Neighboring Dataset, dmns06
  • Definition 3.3: Differential Privacy, dmns06
  • Definition 3.4: Random response mechanism
  • Lemma 3.5: Basic composition, gkk+23
  • Theorem 4.1: Privacy for Query, informal version of Theorem \ref{['thm:query_privacy:formal']}
  • Theorem 4.2: Accuracy (compare DPBloom with true-answer) for Query, informal version of Theorem \ref{['thm:dpbloom_true_accuracy:formal']}
  • Theorem 4.3: Running complexity of DPBloomfilter
  • proof
  • Definition 5.1: Definition of $W$
  • ...and 25 more