Table of Contents
Fetching ...

Private Synthetic Data Generation in Bounded Memory

Rayne Holland, Seyit Camtepe, Chandra Thapa, Minhui Xue

TL;DR

PrivHP introduces a differentially private synthetic data generator designed for streaming under bounded memory. It leverages a private, pruned hierarchical partition of the input domain and private sketches to focus on high-frequency regions, enabling a controllable space-utility trade-off via the pruning parameter k. The authors prove a 1-Wasserstein utility bound that decomposes into privacy-noise and tail-frequency components, applicable to general metric spaces and specialized for [0,1]^d, with memory M = O(k log^2 n). This framework achieves ε-DP with tunable memory consumption, bridging optimal static private-data methods and memory-bounded streaming regimes for practical privacy-preserving data synthesis.

Abstract

We propose $\mathtt{PrivHP}$, a lightweight synthetic data generator with \textit{differential privacy} guarantees. $\mathtt{PrivHP}$ uses a novel hierarchical decomposition that approximates the input's cumulative distribution function (CDF) in bounded memory. It balances hierarchy depth, noise addition, and pruning of low-frequency subdomains while preserving frequent ones. Private sketches estimate subdomain frequencies efficiently without full data access. A key feature is the pruning parameter $k$, which controls the trade-off between space and utility. We define the skew measure $\mathtt{tail}_k$, capturing all but the top $k$ subdomain frequencies. Given a dataset $\mathcal{X}$, $\mathtt{PrivHP}$ uses $M=\mathcal{O}(k\log^2 |X|)$ space and, for input domain $Ω= [0,1]$, ensures $\varepsilon$-differential privacy. It yields a generator with expected Wasserstein distance: \[ \mathcal{O}\left(\frac{\log^2 M}{\varepsilon n} + \frac{||\mathtt{tail}_k(\mathcal{X})||_1}{M n}\right) \] from the empirical distribution. This parameterized trade-off offers a level of flexibility unavailable in prior work. We also provide interpretable utility bounds that account for hierarchy depth, privacy noise, pruning, and frequency estimation errors.

Private Synthetic Data Generation in Bounded Memory

TL;DR

PrivHP introduces a differentially private synthetic data generator designed for streaming under bounded memory. It leverages a private, pruned hierarchical partition of the input domain and private sketches to focus on high-frequency regions, enabling a controllable space-utility trade-off via the pruning parameter k. The authors prove a 1-Wasserstein utility bound that decomposes into privacy-noise and tail-frequency components, applicable to general metric spaces and specialized for [0,1]^d, with memory M = O(k log^2 n). This framework achieves ε-DP with tunable memory consumption, bridging optimal static private-data methods and memory-bounded streaming regimes for practical privacy-preserving data synthesis.

Abstract

We propose , a lightweight synthetic data generator with \textit{differential privacy} guarantees. uses a novel hierarchical decomposition that approximates the input's cumulative distribution function (CDF) in bounded memory. It balances hierarchy depth, noise addition, and pruning of low-frequency subdomains while preserving frequent ones. Private sketches estimate subdomain frequencies efficiently without full data access. A key feature is the pruning parameter , which controls the trade-off between space and utility. We define the skew measure , capturing all but the top subdomain frequencies. Given a dataset , uses space and, for input domain , ensures -differential privacy. It yields a generator with expected Wasserstein distance: from the empirical distribution. This parameterized trade-off offers a level of flexibility unavailable in prior work. We also provide interpretable utility bounds that account for hierarchy depth, privacy noise, pruning, and frequency estimation errors.

Paper Structure

This paper contains 30 sections, 19 theorems, 74 equations, 4 figures, 1 table, 3 algorithms.

Key Result

Theorem 1

When $\Omega = [0,1]^d$, for pruning parameter $k$, $\mathtt{PrivHP}$ can process a stream $\mathcal{X}$ of size $n$ in $M=\mathcal{O}(k\log^2(n))$ memory and $\mathcal{O}(\log (\varepsilon n))$ update time. $\mathtt{PrivHP}$ can subsequently output a $\varepsilon$-differentially private synthetic d

Figures (4)

  • Figure 1: The update procedure for a Count-Min Sketch. It follows that $h_j(x)=i$.
  • Figure 2: Illustration of Algorithm \ref{['alg:growpartition']} with $k=2, L_{\star} = 1$ and $L=4$. Figure \ref{['fig:input_growpartition']} represents its input.
  • Figure 3: Subtree used for Example \ref{['example:miss']}
  • Figure 4: Illustration of the proof pipeline for Theorem \ref{['thm:utilityHP']}. $k = 2$, $L_{\star} = 2$, $L=3$.

Theorems & Definitions (30)

  • Theorem 1
  • Definition 1: Differential Privacy -- 1-Pass
  • Lemma 1: Laplace Mechanism
  • Lemma 2: Post-Processing
  • Lemma 3: Basic Composition
  • Lemma 4
  • Theorem 2
  • proof
  • Theorem 3
  • Lemma 5
  • ...and 20 more