Table of Contents
Fetching ...

Differentially Private Stream Processing at Scale

Bing Zhang, Vadym Doroshenko, Peter Kairouz, Thomas Steinke, Abhradeep Thakurta, Ziyin Ma, Eidan Cohen, Himani Apte, Jodi Spacek

TL;DR

This work introduces DP-SQLP, the first at-scale differential privacy streaming aggregation system that supports unknown-domain histograms and user-level DP. It combines novel algorithms for private key selection, preemptive execution to avoid scanning all keys, and hierarchical DP-tree perturbation to release continual DP histograms in real time. Empirical results on synthetic and Reddit data show substantial accuracy gains (up to around $16\times$ or more in some baselines) and scalability to billions of keys, with two industrial deployments (Google Shopping and Google Trends) demonstrating practical impact. The system integrates a Spark-like streaming framework with Spanner and F1, achieving low latency and robust DP guarantees, and points to future improvements via DP-MF-based streaming and pan-privacy extensions.

Abstract

We design, to the best of our knowledge, the first differentially private (DP) stream aggregation processing system at scale. Our system -- Differential Privacy SQL Pipelines (DP-SQLP) -- is built using a streaming framework similar to Spark streaming, and is built on top of the Spanner database and the F1 query engine from Google. Towards designing DP-SQLP we make both algorithmic and systemic advances, namely, we (i) design a novel (user-level) DP key selection algorithm that can operate on an unbounded set of possible keys, and can scale to one billion keys that users have contributed, (ii) design a preemptive execution scheme for DP key selection that avoids enumerating all the keys at each triggering time, and (iii) use algorithmic techniques from DP continual observation to release a continual DP histogram of user contributions to different keys over the stream length. We empirically demonstrate the efficacy by obtaining at least $16\times$ reduction in error over meaningful baselines we consider. We implemented a streaming differentially private user impressions for Google Shopping with DP-SQLP. The streaming DP algorithms are further applied to Google Trends.

Differentially Private Stream Processing at Scale

TL;DR

This work introduces DP-SQLP, the first at-scale differential privacy streaming aggregation system that supports unknown-domain histograms and user-level DP. It combines novel algorithms for private key selection, preemptive execution to avoid scanning all keys, and hierarchical DP-tree perturbation to release continual DP histograms in real time. Empirical results on synthetic and Reddit data show substantial accuracy gains (up to around or more in some baselines) and scalability to billions of keys, with two industrial deployments (Google Shopping and Google Trends) demonstrating practical impact. The system integrates a Spark-like streaming framework with Spanner and F1, achieving low latency and robust DP guarantees, and points to future improvements via DP-MF-based streaming and pan-privacy extensions.

Abstract

We design, to the best of our knowledge, the first differentially private (DP) stream aggregation processing system at scale. Our system -- Differential Privacy SQL Pipelines (DP-SQLP) -- is built using a streaming framework similar to Spark streaming, and is built on top of the Spanner database and the F1 query engine from Google. Towards designing DP-SQLP we make both algorithmic and systemic advances, namely, we (i) design a novel (user-level) DP key selection algorithm that can operate on an unbounded set of possible keys, and can scale to one billion keys that users have contributed, (ii) design a preemptive execution scheme for DP key selection that avoids enumerating all the keys at each triggering time, and (iii) use algorithmic techniques from DP continual observation to release a continual DP histogram of user contributions to different keys over the stream length. We empirically demonstrate the efficacy by obtaining at least reduction in error over meaningful baselines we consider. We implemented a streaming differentially private user impressions for Google Shopping with DP-SQLP. The streaming DP algorithms are further applied to Google Trends.
Paper Structure (31 sections, 4 theorems, 7 equations, 10 figures, 2 tables, 4 algorithms)

This paper contains 31 sections, 4 theorems, 7 equations, 10 figures, 2 tables, 4 algorithms.

Key Result

Theorem 3.1

Algorithm Alg:key-selection is $(\varepsilon, \delta + (e^\varepsilon+1)\cdot\beta)$-DP for addition or removal of one element of the dataset.

Figures (10)

  • Figure 1: Event time domain and processing time domain
  • Figure 2: High-level overview for the DP-SQLP system
  • Figure 3: Two types of streaming scheduling
  • Figure 4: Overview of streaming differential privacy mechanism
  • Figure 5: Execution of streaming differential privacy mechanism
  • ...and 5 more figures

Theorems & Definitions (12)

  • Theorem 3.1: Privacy guarantee
  • Theorem 3.2: Utility guarantee
  • Definition A.1: Input driven stream processing
  • Definition A.2: Event time and event time domain
  • Definition A.3: Processing time and processing time domain
  • Definition A.4: Windowing
  • Definition A.5: Triggering
  • Definition A.6: Micro-batch
  • Definition B.1
  • Definition B.2: Differential Privacy DMNSODO
  • ...and 2 more