Table of Contents
Fetching ...

DPI: Ensuring Strict Differential Privacy for Infinite Data Streaming

Shuya Feng, Meisam Mohammady, Han Wang, Xiaochen Li, Zhan Qin, Yuan Hong

TL;DR

DPI addresses the challenge of protecting user-level privacy in infinite data streams by bounding cumulative privacy loss while maintaining data utility. It combines sensitivity compression via a PDF representation, a 0-DP synopsis pool, and a bi-directional DPI boosting mechanism with a Random Budget Allocation scheme to ensure a converging privacy budget. The framework provides formal privacy guarantees with a logarithmic growth of the total privacy budget and derives an optimal boosting strategy to maximize utility under DP constraints. Extensive experiments on real and synthetic streams demonstrate DPI’s strong privacy protection and high utility across applications such as statistical queries, anomaly detection, and recommender systems. This work enables practical, privacy-preserving real-time analytics for unbounded data streams.

Abstract

Streaming data, crucial for applications like crowdsourcing analytics, behavior studies, and real-time monitoring, faces significant privacy risks due to the large and diverse data linked to individuals. In particular, recent efforts to release data streams, using the rigorous privacy notion of differential privacy (DP), have encountered issues with unbounded privacy leakage. This challenge limits their applicability to only a finite number of time slots (''finite data stream'') or relaxation to protecting the events (''event or $w$-event DP'') rather than all the records of users. A persistent challenge is managing the sensitivity of outputs to inputs in situations where users contribute many activities and data distributions evolve over time. In this paper, we present a novel technique for Differentially Private data streaming over Infinite disclosure (DPI) that effectively bounds the total privacy leakage of each user in infinite data streams while enabling accurate data collection and analysis. Furthermore, we also maximize the accuracy of DPI via a novel boosting mechanism. Finally, extensive experiments across various streaming applications and real datasets (e.g., COVID-19, Network Traffic, and USDA Production), show that DPI maintains high utility for infinite data streams in diverse settings. Code for DPI is available at https://github.com/ShuyaFeng/DPI.

DPI: Ensuring Strict Differential Privacy for Infinite Data Streaming

TL;DR

DPI addresses the challenge of protecting user-level privacy in infinite data streams by bounding cumulative privacy loss while maintaining data utility. It combines sensitivity compression via a PDF representation, a 0-DP synopsis pool, and a bi-directional DPI boosting mechanism with a Random Budget Allocation scheme to ensure a converging privacy budget. The framework provides formal privacy guarantees with a logarithmic growth of the total privacy budget and derives an optimal boosting strategy to maximize utility under DP constraints. Extensive experiments on real and synthetic streams demonstrate DPI’s strong privacy protection and high utility across applications such as statistical queries, anomaly detection, and recommender systems. This work enables practical, privacy-preserving real-time analytics for unbounded data streams.

Abstract

Streaming data, crucial for applications like crowdsourcing analytics, behavior studies, and real-time monitoring, faces significant privacy risks due to the large and diverse data linked to individuals. In particular, recent efforts to release data streams, using the rigorous privacy notion of differential privacy (DP), have encountered issues with unbounded privacy leakage. This challenge limits their applicability to only a finite number of time slots (''finite data stream'') or relaxation to protecting the events (''event or -event DP'') rather than all the records of users. A persistent challenge is managing the sensitivity of outputs to inputs in situations where users contribute many activities and data distributions evolve over time. In this paper, we present a novel technique for Differentially Private data streaming over Infinite disclosure (DPI) that effectively bounds the total privacy leakage of each user in infinite data streams while enabling accurate data collection and analysis. Furthermore, we also maximize the accuracy of DPI via a novel boosting mechanism. Finally, extensive experiments across various streaming applications and real datasets (e.g., COVID-19, Network Traffic, and USDA Production), show that DPI maintains high utility for infinite data streams in diverse settings. Code for DPI is available at https://github.com/ShuyaFeng/DPI.
Paper Structure (43 sections, 6 theorems, 25 equations, 15 figures, 3 tables, 4 algorithms)

This paper contains 43 sections, 6 theorems, 25 equations, 15 figures, 3 tables, 4 algorithms.

Key Result

Theorem 5.1

Consider a series of DPI mechanisms, each triggered with $\eta^{Q}_i$ and $\eta^{\mathcal{A}}_i$. The overall privacy budget of DPI after $t$ iterations is given by:

Figures (15)

  • Figure 1: DPI guarantees limited privacy bound and sensitivity by managing the pace of convergence and error margins, striving to attain the highest level of accuracy (the output streaming PDFs can support most real-time applications as the input stream).
  • Figure 2: The DPI framework in time slot $t$. (1) $Q^t_s$ can be a single query or a subset of queries, and can remain the same or dynamically change over time, (2) all the privacy budgets are pre-computed with converging series and $\epsilon_t$ is sampled in real-time, and halved for query-sampling and synopsis-sampling, (3) generator returns the PDF output (sampled synopsis) for the current time slot $t$ to the interface/analyst, and also sends it for boosting, and (4) DPI boosting (bi-directional) reweights the PDFs (both query-sampling probability and synopsis-sampling probability) for the next time slot $(t+1)$.
  • Figure 3: Average MSE and KL divergence regarding $\epsilon$ for $2,700$ time slots. Figures (a), (c), (e), (g), (i), and (k) are MSE and KL divergence with different $\lambda$, $\mu$, and $\zeta$ values for accumulative data distribution disclosure. Figures (b), (d), (f), (h), (j), and (l) are the MSE and KL divergence with different $\lambda$, $\mu$, and $\zeta$ values for instantaneous data distribution disclosure.
  • Figure 4: Average MSE and KL divergence of statistical queries for $\epsilon$ and $t$ across $1,000$ time slots on COVID-19 dataset. (a), (b): Average MSE and KL divergence with varying $\epsilon$. (c), (d): Average MSE and KL divergence over $t$ when $\epsilon$ is $2$.
  • Figure 5: Average Precision and Recall for anomaly detection over $2,700$ time slots on Network Traffic dataset. (a), (b): anomaly detection with different $\epsilon$. (c), (d): anomaly detection in time slot $t$ ($\epsilon = 2$). Due to unsupervised learning on unlabeled data, non-private results are treated as the ground truth.
  • ...and 10 more figures

Theorems & Definitions (14)

  • Definition 1: Sampled Queries $Q_s$
  • Definition 2: Synopses
  • Definition 3: Base Synopsis Generator dwork2010boosting
  • Theorem 5.1
  • proof
  • Theorem 5.2: Proof in Appendix \ref{['thm:3']}
  • Definition 4
  • Remark 5.1: Proof in Appendix \ref{['rem1:utility']}
  • Theorem 5.3: Proof in Appendix \ref{['rem:privacy']}
  • Theorem 5.4: Proof in Appendix \ref{['thm:4']}
  • ...and 4 more