Table of Contents
Fetching ...

Differentially Private Low-dimensional Synthetic Data from High-dimensional Datasets

Yiyun He, Thomas Strohmer, Roman Vershynin, Yizhe Zhu

TL;DR

The paper addresses private data sharing for high-dimensional datasets by constructing a differentially private, low-dimensional synthetic data generator with a Wasserstein-1 utility guarantee. It introduces a private PCA procedure that does not rely on a spectral-gap assumption and uses a centered covariance framework to obtain a private d'-dimensional subspace, followed by subspace projection and DP synthetic-data routines (PMM for d'=2 and PSMM for d'≥3). The resulting algorithm achieves a three-term error bound that scales with the tail of the covariance spectrum, the private-PCA perturbation, and the subspace data-generation term, with improved rates when the data lie in an affine subspace. The work also provides adaptive private selection of d', analyzes privacy guarantees through composition, and discusses comparisons to prior DP synthetic-data approaches, highlighting practical efficiency and the potential for broader applicability beyond the linear subspace setting.

Abstract

Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Unlike the standard perturbation analysis, our analysis of private PCA works without assuming the spectral gap for the covariance matrix.

Differentially Private Low-dimensional Synthetic Data from High-dimensional Datasets

TL;DR

The paper addresses private data sharing for high-dimensional datasets by constructing a differentially private, low-dimensional synthetic data generator with a Wasserstein-1 utility guarantee. It introduces a private PCA procedure that does not rely on a spectral-gap assumption and uses a centered covariance framework to obtain a private d'-dimensional subspace, followed by subspace projection and DP synthetic-data routines (PMM for d'=2 and PSMM for d'≥3). The resulting algorithm achieves a three-term error bound that scales with the tail of the covariance spectrum, the private-PCA perturbation, and the subspace data-generation term, with improved rates when the data lie in an affine subspace. The work also provides adaptive private selection of d', analyzes privacy guarantees through composition, and discusses comparisons to prior DP synthetic-data approaches, highlighting practical efficiency and the potential for broader applicability beyond the linear subspace setting.

Abstract

Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Unlike the standard perturbation analysis, our analysis of private PCA works without assuming the spectral gap for the covariance matrix.
Paper Structure (29 sections, 15 theorems, 70 equations, 5 algorithms)

This paper contains 29 sections, 15 theorems, 70 equations, 5 algorithms.

Key Result

Theorem 1.2

Let $\Omega=[0,1]^d$ equipped with $\ell^{\infty}$ metric and $\mathbf{X}=[X_1,\dots, X_n]\in \Omega^n$ be a dataset. For any $2\leq d'\leq d$, Algorithm alg: affine outputs an $\varepsilon$-differentially private synthetic dataset $\mathbf{Y}=[Y_1,\dots, Y_m]\in \Omega^{m}$ for some $m\geq 1$ in po where $\sigma_i(\mathbf{M})$ is the $i$-th largest eigenvalue value of $\mathbf{M}$ in eq: sample_c

Theorems & Definitions (33)

  • Definition 1.1: Differential Privacy dwork2014algorithmic
  • Theorem 1.2
  • Definition 2.1: Differential privacy
  • Lemma 2.2: Theorem 3.16 in dwork2014algorithmic
  • Lemma 2.3: Theorem 1 in dwork2006our
  • Definition 2.4: Integer Laplacian distribution, inusah2006discrete
  • Definition 2.5: $p$-Wasserstein distance
  • Proposition 3.1
  • proof : Proof
  • Lemma 3.2: Stability of noisy projection
  • ...and 23 more