Table of Contents
Fetching ...

High-Dimensional Asymptotics of Differentially Private PCA

Youngjoo Yun, Rishabh Dudeja

TL;DR

This work develops sharp, high-dimensional characterizations of both utility and privacy for the exponential mechanism in differentially private PCA. By analyzing the regime $p\to\infty$ under spectral regularity, it derives exact limits for the overlap between true and privatized PCs and reveals phase transitions and a privacy plateau as the noise parameter varies. The authors introduce a data-dependent noise calibration that achieves target AGDP guarantees, along with a sampling algorithm whose output approximates the Gibbs distribution in total variation. Their approach combines trade-off-function DP with Le Cam contiguity, through spherical-integral techniques and Gaussian approximations, yielding end-to-end privacy guarantees even when using dataset-dependent spectral estimates. Empirical results on real genomic data corroborate the asymptotic predictions and demonstrate the practical viability of data-adaptive privacy for privatized PCA.

Abstract

In differential privacy, statistics of a sensitive dataset are privatized by introducing random noise. Most privacy analyses provide privacy bounds specifying a noise level sufficient to achieve a target privacy guarantee. Sometimes, these bounds are pessimistic and suggest adding excessive noise, which overwhelms the meaningful signal. It remains unclear if such high noise levels are truly necessary or a limitation of the proof techniques. This paper explores whether we can obtain sharp privacy characterizations that identify the smallest noise level required to achieve a target privacy level for a given mechanism. We study this problem in the context of differentially private principal component analysis, where the goal is to privatize the leading principal components (PCs) of a dataset with n samples and p features. We analyze the exponential mechanism for this problem in a model-free setting and provide sharp utility and privacy characterizations in the high-dimensional limit ($p\rightarrow\infty$). Our privacy result shows that, in high dimensions, detecting the presence of a target individual in the dataset using the privatized PCs is exactly as hard as distinguishing two Gaussians with slightly different means, where the mean difference depends on certain spectral properties of the dataset. Our privacy analysis combines the hypothesis-testing formulation of privacy guarantees proposed by Dong, Roth, and Su (2022) with classical contiguity arguments due to Le Cam to obtain sharp high-dimensional privacy characterizations.

High-Dimensional Asymptotics of Differentially Private PCA

TL;DR

This work develops sharp, high-dimensional characterizations of both utility and privacy for the exponential mechanism in differentially private PCA. By analyzing the regime under spectral regularity, it derives exact limits for the overlap between true and privatized PCs and reveals phase transitions and a privacy plateau as the noise parameter varies. The authors introduce a data-dependent noise calibration that achieves target AGDP guarantees, along with a sampling algorithm whose output approximates the Gibbs distribution in total variation. Their approach combines trade-off-function DP with Le Cam contiguity, through spherical-integral techniques and Gaussian approximations, yielding end-to-end privacy guarantees even when using dataset-dependent spectral estimates. Empirical results on real genomic data corroborate the asymptotic predictions and demonstrate the practical viability of data-adaptive privacy for privatized PCA.

Abstract

In differential privacy, statistics of a sensitive dataset are privatized by introducing random noise. Most privacy analyses provide privacy bounds specifying a noise level sufficient to achieve a target privacy guarantee. Sometimes, these bounds are pessimistic and suggest adding excessive noise, which overwhelms the meaningful signal. It remains unclear if such high noise levels are truly necessary or a limitation of the proof techniques. This paper explores whether we can obtain sharp privacy characterizations that identify the smallest noise level required to achieve a target privacy level for a given mechanism. We study this problem in the context of differentially private principal component analysis, where the goal is to privatize the leading principal components (PCs) of a dataset with n samples and p features. We analyze the exponential mechanism for this problem in a model-free setting and provide sharp utility and privacy characterizations in the high-dimensional limit (). Our privacy result shows that, in high dimensions, detecting the presence of a target individual in the dataset using the privatized PCs is exactly as hard as distinguishing two Gaussians with slightly different means, where the mean difference depends on certain spectral properties of the dataset. Our privacy analysis combines the hypothesis-testing formulation of privacy guarantees proposed by Dong, Roth, and Su (2022) with classical contiguity arguments due to Le Cam to obtain sharp high-dimensional privacy characterizations.

Paper Structure

This paper contains 129 sections, 30 theorems, 513 equations, 7 figures, 4 algorithms.

Key Result

Theorem 1

Consider a sequence of datasets $X \subset \mathbb{R}^p$ which satisfies assump:data, a sequence of noise parameters $\beta_{p} \rightarrow \beta \in [0,\infty)$ as $p \rightarrow \infty$, and a fixed rank $k \in \mathbb{N}$ (independent of $p$). Let ${U}_{\star} \in \mathbb{R}^{p \times k}$ denote

Figures (7)

  • Figure 1: Projections of the 1000 Genomes dataset onto the first $k=2$ PCs. Left to right: (1) non-private PCs, (2--4) privatized PCs with $\beta=0.01$, $\beta=0.2$, and $\beta = 1.2,$ respectively.
  • Figure 2: Visualization of trade-off-function lower bounds implied by various privacy guarantees.
  • Figure 3: Estimation error of the exponential mechanism in operator norm (left panel) and Frobenius norm (right panel) as a function of $\beta$ for $k = 1,2,3$ on the 1000 Genomes dataset. Solid curves represent theoretical predictions derived from \ref{['thm:utility']} and circular markers represent the empirical error, averaged over 30,000 Monte Carlo simulations. The dotted vertical lines represent the threshold values $\beta = H_\mu(\gamma_{1:3})$, at which phase transitions occur.
  • Figure 4: Leftmost Panel: Noise parameter $\beta$ v.s. privacy parameter $\sigma_{\beta}$ for the 1000 Genomes dataset for $k = 1,2,3$. Three Panels on Right: Projections of the 1000 Genomes dataset onto the first $k=2$ privatized PCs for $\beta=1.2,$$\beta=3.97,$ and $\beta=8.47.$ The privatized PCs with these $\beta$ values satisfy reasonable $\sigma$-AGDP guarantees with $\sigma = 0.5$, $\sigma=1,$ and $\sigma=1.5,$ respectively.
  • Figure 5: Trade-off functions for the exponential mechanism on the 1000 Genomes dataset for rank $k \in \{1, 2, 3\}$ and $\beta$ calibrated to achieve target privacy levels $\sigma \in \{0.5, 1, 1.5\}$. For $k=3$, only $\sigma=1.5$ is achievable. Solid curves: theoretical predictions from \ref{['thm:privacy']}. Circular markers: empirically estimated trade-off functions (using 30,000 Monte Carlo samples). Dashed curves: non-asymptotic privacy bounds from prior work chaudhuri2013near, which are essentially vacuous for these values of $\beta$ and coincide with the axes.
  • ...and 2 more figures

Theorems & Definitions (75)

  • Definition 1: Exponential Mechanism for Differentially Private PCA mcsherry2007mechanismchaudhuri2013near
  • Definition 2: Neighboring datasets
  • Definition 3: Tradeoff function dong2022gaussian
  • Definition 4: Rényi Divergence and Rényi Differential Privacy mironov2017renyi
  • Theorem 1
  • Definition 5: Asymptotic Gaussian Differential Privacy
  • Theorem 2
  • Theorem 3
  • proof : Proof sketch.
  • Theorem 4
  • ...and 65 more