Private Synthetic Data Generation in Bounded Memory

Rayne Holland; Seyit Camtepe; Chandra Thapa; Minhui Xue

Private Synthetic Data Generation in Bounded Memory

Rayne Holland, Seyit Camtepe, Chandra Thapa, Minhui Xue

TL;DR

PrivHP introduces a differentially private synthetic data generator designed for streaming under bounded memory. It leverages a private, pruned hierarchical partition of the input domain and private sketches to focus on high-frequency regions, enabling a controllable space-utility trade-off via the pruning parameter k. The authors prove a 1-Wasserstein utility bound that decomposes into privacy-noise and tail-frequency components, applicable to general metric spaces and specialized for [0,1]^d, with memory M = O(k log^2 n). This framework achieves ε-DP with tunable memory consumption, bridging optimal static private-data methods and memory-bounded streaming regimes for practical privacy-preserving data synthesis.

Abstract

We propose $\mathtt{PrivHP}$, a lightweight synthetic data generator with \textit{differential privacy} guarantees. $\mathtt{PrivHP}$ uses a novel hierarchical decomposition that approximates the input's cumulative distribution function (CDF) in bounded memory. It balances hierarchy depth, noise addition, and pruning of low-frequency subdomains while preserving frequent ones. Private sketches estimate subdomain frequencies efficiently without full data access. A key feature is the pruning parameter $k$, which controls the trade-off between space and utility. We define the skew measure $\mathtt{tail}_k$, capturing all but the top $k$ subdomain frequencies. Given a dataset $\mathcal{X}$, $\mathtt{PrivHP}$ uses $M=\mathcal{O}(k\log^2 |X|)$ space and, for input domain $Ω= [0,1]$, ensures $\varepsilon$-differential privacy. It yields a generator with expected Wasserstein distance: \[ \mathcal{O}\left(\frac{\log^2 M}{\varepsilon n} + \frac{||\mathtt{tail}_k(\mathcal{X})||_1}{M n}\right) \] from the empirical distribution. This parameterized trade-off offers a level of flexibility unavailable in prior work. We also provide interpretable utility bounds that account for hierarchy depth, privacy noise, pruning, and frequency estimation errors.

Private Synthetic Data Generation in Bounded Memory

TL;DR

Abstract

We propose

, a lightweight synthetic data generator with \textit{differential privacy} guarantees.

uses a novel hierarchical decomposition that approximates the input's cumulative distribution function (CDF) in bounded memory. It balances hierarchy depth, noise addition, and pruning of low-frequency subdomains while preserving frequent ones. Private sketches estimate subdomain frequencies efficiently without full data access. A key feature is the pruning parameter

, which controls the trade-off between space and utility. We define the skew measure

, capturing all but the top

subdomain frequencies. Given a dataset

uses

space and, for input domain

, ensures

-differential privacy. It yields a generator with expected Wasserstein distance:

from the empirical distribution. This parameterized trade-off offers a level of flexibility unavailable in prior work. We also provide interpretable utility bounds that account for hierarchy depth, privacy noise, pruning, and frequency estimation errors.

Private Synthetic Data Generation in Bounded Memory

TL;DR

Abstract

Private Synthetic Data Generation in Bounded Memory

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (30)