High-Dimensional Asymptotics of Differentially Private PCA
Youngjoo Yun, Rishabh Dudeja
TL;DR
This work develops sharp, high-dimensional characterizations of both utility and privacy for the exponential mechanism in differentially private PCA. By analyzing the regime $p\to\infty$ under spectral regularity, it derives exact limits for the overlap between true and privatized PCs and reveals phase transitions and a privacy plateau as the noise parameter varies. The authors introduce a data-dependent noise calibration that achieves target AGDP guarantees, along with a sampling algorithm whose output approximates the Gibbs distribution in total variation. Their approach combines trade-off-function DP with Le Cam contiguity, through spherical-integral techniques and Gaussian approximations, yielding end-to-end privacy guarantees even when using dataset-dependent spectral estimates. Empirical results on real genomic data corroborate the asymptotic predictions and demonstrate the practical viability of data-adaptive privacy for privatized PCA.
Abstract
In differential privacy, statistics of a sensitive dataset are privatized by introducing random noise. Most privacy analyses provide privacy bounds specifying a noise level sufficient to achieve a target privacy guarantee. Sometimes, these bounds are pessimistic and suggest adding excessive noise, which overwhelms the meaningful signal. It remains unclear if such high noise levels are truly necessary or a limitation of the proof techniques. This paper explores whether we can obtain sharp privacy characterizations that identify the smallest noise level required to achieve a target privacy level for a given mechanism. We study this problem in the context of differentially private principal component analysis, where the goal is to privatize the leading principal components (PCs) of a dataset with n samples and p features. We analyze the exponential mechanism for this problem in a model-free setting and provide sharp utility and privacy characterizations in the high-dimensional limit ($p\rightarrow\infty$). Our privacy result shows that, in high dimensions, detecting the presence of a target individual in the dataset using the privatized PCs is exactly as hard as distinguishing two Gaussians with slightly different means, where the mean difference depends on certain spectral properties of the dataset. Our privacy analysis combines the hypothesis-testing formulation of privacy guarantees proposed by Dong, Roth, and Su (2022) with classical contiguity arguments due to Le Cam to obtain sharp high-dimensional privacy characterizations.
