Table of Contents
Fetching ...

Optimal Differentially Private PCA and Estimation for Spiked Covariance Matrices

T. Tony Cai, Dong Xia, Mengyue Zha

TL;DR

This work develops optimal rates for differential privacy in PCA and covariance estimation within the spiked covariance model $\\Sigma = U \\Lambda U^{\\top} + \\sigma^2 I_p$, allowing diverging rank and high-dimensional regimes. By leveraging a Gaussian mechanism on the spectral projector and a sharp sensitivity analysis of eigenvectors and eigenvalues, the authors derive minimax upper bounds that hold across Schatten norms and match lower bounds via a DP-Fano framework, up to polylog factors. The methodology separates privatization of eigenvectors and eigenvalues to address different sensitivities, extends to sub-Gaussian distributions, and includes a private estimator for the nuisance variance via bulk eigenvalues. Numerical experiments, including simulations and MNIST data, demonstrate favorable privacy-utility tradeoffs and validate the theoretical rates, showing robustness to rank and dimensionality and applicability when the sample size is smaller than the ambient dimension under sufficient signal-to-noise ratio.

Abstract

Estimating a covariance matrix and its associated principal components is a fundamental problem in contemporary statistics. While optimal estimation procedures have been developed with well-understood properties, the increasing demand for privacy preservation introduces new complexities to this classical problem. In this paper, we study optimal differentially private Principal Component Analysis (PCA) and covariance estimation within the spiked covariance model. We precisely characterize the sensitivity of eigenvalues and eigenvectors under this model and establish the minimax rates of convergence for estimating both the principal components and covariance matrix. These rates hold up to logarithmic factors and encompass general Schatten norms, including spectral norm, Frobenius norm, and nuclear norm as special cases. We propose computationally efficient differentially private estimators and prove their minimax optimality for sub-Gaussian distributions, up to logarithmic factors. Additionally, matching minimax lower bounds are established. Notably, compared to the existing literature, our results accommodate a diverging rank, a broader range of signal strengths, and remain valid even when the sample size is much smaller than the dimension, provided the signal strength is sufficiently strong. Both simulation studies and real data experiments demonstrate the merits of our method.

Optimal Differentially Private PCA and Estimation for Spiked Covariance Matrices

TL;DR

This work develops optimal rates for differential privacy in PCA and covariance estimation within the spiked covariance model , allowing diverging rank and high-dimensional regimes. By leveraging a Gaussian mechanism on the spectral projector and a sharp sensitivity analysis of eigenvectors and eigenvalues, the authors derive minimax upper bounds that hold across Schatten norms and match lower bounds via a DP-Fano framework, up to polylog factors. The methodology separates privatization of eigenvectors and eigenvalues to address different sensitivities, extends to sub-Gaussian distributions, and includes a private estimator for the nuisance variance via bulk eigenvalues. Numerical experiments, including simulations and MNIST data, demonstrate favorable privacy-utility tradeoffs and validate the theoretical rates, showing robustness to rank and dimensionality and applicability when the sample size is smaller than the ambient dimension under sufficient signal-to-noise ratio.

Abstract

Estimating a covariance matrix and its associated principal components is a fundamental problem in contemporary statistics. While optimal estimation procedures have been developed with well-understood properties, the increasing demand for privacy preservation introduces new complexities to this classical problem. In this paper, we study optimal differentially private Principal Component Analysis (PCA) and covariance estimation within the spiked covariance model. We precisely characterize the sensitivity of eigenvalues and eigenvectors under this model and establish the minimax rates of convergence for estimating both the principal components and covariance matrix. These rates hold up to logarithmic factors and encompass general Schatten norms, including spectral norm, Frobenius norm, and nuclear norm as special cases. We propose computationally efficient differentially private estimators and prove their minimax optimality for sub-Gaussian distributions, up to logarithmic factors. Additionally, matching minimax lower bounds are established. Notably, compared to the existing literature, our results accommodate a diverging rank, a broader range of signal strengths, and remain valid even when the sample size is much smaller than the dimension, provided the signal strength is sufficiently strong. Both simulation studies and real data experiments demonstrate the merits of our method.
Paper Structure (53 sections, 22 theorems, 261 equations, 7 figures, 1 algorithm)

This paper contains 53 sections, 22 theorems, 261 equations, 7 figures, 1 algorithm.

Key Result

Lemma 1

Let $X$ be a given data set and $X'$ be any neighboring data set of $X$, i.e., $X$ and $X'$ differs by at most one observation. The sensitivity of a function $f$ that maps $X$ into $\mathbb{R}^{d_1\times d_2}$ is defined by Then, for any $\varepsilon > 0$ and $\delta \in [0, 1)$, the randomized algorithm $A$ defined by $A(X)=f(X)+Z$ where $Z$ has i.i.d. $\mathcal{N}(0, 2\omega_f^2\varepsilon^{-2}

Figures (7)

  • Figure 1: Comparison of our method, DP-Ojaliu_xiyang2022dp-pca, and DP-Gauss, DP-Gauss*dwork2014analyze in differentially private PCA with varying $n$ and $r$. The dimension $p=50$, $\lambda=10, \sigma^2=1$, and privacy constraints $\varepsilon=1, \delta=0.1$.
  • Figure 2: Comparison of our method, DP-Ojaliu_xiyang2022dp-pca, and DP-Gauss, DP-Gauss*dwork2014analyze in differentially private PCA with varying $\varepsilon$ and $\lambda$. The dimension $p=50$, $\sigma^2=1$, and privacy constraint $\delta=0.1$.
  • Figure 3: Comparison of our method, DP-Gauss, and DP-Gauss*dwork2014analyze in differentially private PCA when $p\geq n$ and the signal strength $\lambda$ changes. The dimension $p=50$, $n=30, r=3, \sigma^2=1$, and privacy constraints $\delta=0.1$.
  • Figure 4: Comparison of our method and DP-Gauss*dwork2014analyze in differentially private PCA on MNIST dataset. The privacy constraints are $\varepsilon=2$ and $\delta=0.1$. The total sample size is $n=1500$. All images are downscaled to a size $14\times 14$.
  • Figure 5: Comparison of our method, and DP-Gauss, DP-Gauss*dwork2014analyze in differentially private covariance matrix estimation when sample size $n$ changes. The rank $r=3, \lambda=10$, $\sigma^2=1$, and privacy constraints $\varepsilon=1, \delta=0.1$.
  • ...and 2 more figures

Theorems & Definitions (27)

  • Lemma 1: sensitivity and Gaussian mechanism
  • Lemma 2
  • Remark 1: Worst-case and high-probability privacy guarantee
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Theorem
  • Remark 2: Comparison with Oja's algorithm liu_xiyang2022dp-pca
  • Theorem
  • Remark 3: Comparison with dwork2014analyze and mangoubi2022re
  • ...and 17 more