Table of Contents
Fetching ...

Differentially Private Synthetic High-dimensional Tabular Stream

Girish Kumar, Thomas Strohmer, Roman Vershynin

TL;DR

An algorithmic framework for streaming data that generates multiple synthetic datasets over time, tracking changes in the underlying private data is proposed and can be used for high-dimensional tabular data.

Abstract

While differentially private synthetic data generation has been explored extensively in the literature, how to update this data in the future if the underlying private data changes is much less understood. We propose an algorithmic framework for streaming data that generates multiple synthetic datasets over time, tracking changes in the underlying private data. Our algorithm satisfies differential privacy for the entire input stream (continual differential privacy) and can be used for high-dimensional tabular data. Furthermore, we show the utility of our method via experiments on real-world datasets. The proposed algorithm builds upon a popular select, measure, fit, and iterate paradigm (used by offline synthetic data generation algorithms) and private counters for streams.

Differentially Private Synthetic High-dimensional Tabular Stream

TL;DR

An algorithmic framework for streaming data that generates multiple synthetic datasets over time, tracking changes in the underlying private data is proposed and can be used for high-dimensional tabular data.

Abstract

While differentially private synthetic data generation has been explored extensively in the literature, how to update this data in the future if the underlying private data changes is much less understood. We propose an algorithmic framework for streaming data that generates multiple synthetic datasets over time, tracking changes in the underlying private data. Our algorithm satisfies differential privacy for the entire input stream (continual differential privacy) and can be used for high-dimensional tabular data. Furthermore, we show the utility of our method via experiments on real-world datasets. The proposed algorithm builds upon a popular select, measure, fit, and iterate paradigm (used by offline synthetic data generation algorithms) and private counters for streams.
Paper Structure (33 sections, 6 theorems, 34 equations, 9 figures, 3 tables, 4 algorithms)

This paper contains 33 sections, 6 theorems, 34 equations, 9 figures, 3 tables, 4 algorithms.

Key Result

theorem 1

The exponential mechanism, as defined in Definition def:exp_mech, satisfies $\varepsilon$-differential privacy.

Figures (9)

  • Figure 1: Metrics over time to compare the baseline and proposed method for the Eviction-weekly dataset.
  • Figure 2: Metrics over time to compare the baseline and proposed method for the Eviction-bi-weekly dataset.
  • Figure 3: Metrics over time to compare the baseline and proposed method for the Adult-randomized-bs-50 dataset.
  • Figure 4: Metrics over time to compare the baseline and proposed method for the Adult-randomized-bs-200 dataset.
  • Figure 5: Metrics over time to compare the baseline and proposed method for the Adult-ordered-bs-50 dataset.
  • ...and 4 more figures

Theorems & Definitions (16)

  • definition 1: Streaming algorithm
  • definition 2: Differential stream
  • definition 3: Neighboring streams
  • definition 4: Differential privacy (for streams)
  • definition 5: k-way marginal query
  • definition 6: Accuracy of an algorithm generating synthetic dataset
  • definition 7: Accuracy of a streaming algorithm
  • definition 8: Exponential mechanism
  • theorem 1: Privacy of exponential mechanism
  • theorem 2: Accuracy of exponential mechanism
  • ...and 6 more