Private Synthetic Data Generation in Bounded Memory
Rayne Holland, Seyit Camtepe, Chandra Thapa, Minhui Xue
TL;DR
PrivHP introduces a differentially private synthetic data generator designed for streaming under bounded memory. It leverages a private, pruned hierarchical partition of the input domain and private sketches to focus on high-frequency regions, enabling a controllable space-utility trade-off via the pruning parameter k. The authors prove a 1-Wasserstein utility bound that decomposes into privacy-noise and tail-frequency components, applicable to general metric spaces and specialized for [0,1]^d, with memory M = O(k log^2 n). This framework achieves ε-DP with tunable memory consumption, bridging optimal static private-data methods and memory-bounded streaming regimes for practical privacy-preserving data synthesis.
Abstract
We propose $\mathtt{PrivHP}$, a lightweight synthetic data generator with \textit{differential privacy} guarantees. $\mathtt{PrivHP}$ uses a novel hierarchical decomposition that approximates the input's cumulative distribution function (CDF) in bounded memory. It balances hierarchy depth, noise addition, and pruning of low-frequency subdomains while preserving frequent ones. Private sketches estimate subdomain frequencies efficiently without full data access. A key feature is the pruning parameter $k$, which controls the trade-off between space and utility. We define the skew measure $\mathtt{tail}_k$, capturing all but the top $k$ subdomain frequencies. Given a dataset $\mathcal{X}$, $\mathtt{PrivHP}$ uses $M=\mathcal{O}(k\log^2 |X|)$ space and, for input domain $Ω= [0,1]$, ensures $\varepsilon$-differential privacy. It yields a generator with expected Wasserstein distance: \[ \mathcal{O}\left(\frac{\log^2 M}{\varepsilon n} + \frac{||\mathtt{tail}_k(\mathcal{X})||_1}{M n}\right) \] from the empirical distribution. This parameterized trade-off offers a level of flexibility unavailable in prior work. We also provide interpretable utility bounds that account for hierarchy depth, privacy noise, pruning, and frequency estimation errors.
