Table of Contents
Fetching ...

Inference for Heteroskedastic PCA with Missing Data

Yuling Yan, Yuxin Chen, Jianqing Fan

TL;DR

This paper proposes a novel approach to performing valid inference on the principal subspace under a spiked covariance model with missing data, on the basis of an estimator called HeteroPCA (Zhang et al., 2022), and develops non-asymptotic distributional guarantees for HeteroPCA.

Abstract

This paper studies how to construct confidence regions for principal component analysis (PCA) in high dimension, a problem that has been vastly under-explored. While computing measures of uncertainty for nonlinear/nonconvex estimators is in general difficult in high dimension, the challenge is further compounded by the prevalent presence of missing data and heteroskedastic noise. We propose a novel approach to performing valid inference on the principal subspace under a spiked covariance model with missing data, on the basis of an estimator called HeteroPCA (Zhang et al., 2022). We develop non-asymptotic distributional guarantees for HeteroPCA, and demonstrate how these can be invoked to compute both confidence regions for the principal subspace and entrywise confidence intervals for the spiked covariance matrix. Our inference procedures are fully data-driven and adaptive to heteroskedastic random noise, without requiring prior knowledge about the noise levels.

Inference for Heteroskedastic PCA with Missing Data

TL;DR

This paper proposes a novel approach to performing valid inference on the principal subspace under a spiked covariance model with missing data, on the basis of an estimator called HeteroPCA (Zhang et al., 2022), and develops non-asymptotic distributional guarantees for HeteroPCA.

Abstract

This paper studies how to construct confidence regions for principal component analysis (PCA) in high dimension, a problem that has been vastly under-explored. While computing measures of uncertainty for nonlinear/nonconvex estimators is in general difficult in high dimension, the challenge is further compounded by the prevalent presence of missing data and heteroskedastic noise. We propose a novel approach to performing valid inference on the principal subspace under a spiked covariance model with missing data, on the basis of an estimator called HeteroPCA (Zhang et al., 2022). We develop non-asymptotic distributional guarantees for HeteroPCA, and demonstrate how these can be invoked to compute both confidence regions for the principal subspace and entrywise confidence intervals for the spiked covariance matrix. Our inference procedures are fully data-driven and adaptive to heteroskedastic random noise, without requiring prior knowledge about the noise levels.

Paper Structure

This paper contains 161 sections, 43 theorems, 662 equations, 6 figures, 2 tables, 5 algorithms.

Key Result

Theorem 1

Assume that each column of the ground truth $\bm{X}$ (cf. eq:definition-X-matrix) is independently generated from $\mathcal{N}(\bm{0},\bm{S}^\star)$, and that the sampling set $\Omega$ follows the random sampling model in Section subsec:intro-model. Suppose that $p<1-\delta$ for some arbitrary const Suppose, in addition, that the number of iterations exceeds Let $\bm{R}$ be the $r\times r$ rotati

Figures (6)

  • Figure 1: The relative estimation error of $\bm{U}$ and $\bm{S}$ returned by both SVD-based approach (cf. Algorithm \ref{['alg:PCA-SVD']}) and HeteroPCA (cf. Algorithm \ref{['alg:PCA-HeteroPCA']}) over different noise level $\omega^{\star}$. (a) Relative estimation errors of $\bm{U}\bm{R}-\bm{U}^{\star}$ measured by $\Vert\cdot\Vert$, $\Vert\cdot\Vert_{\mathrm{F}}$ and $\Vert\cdot\Vert_{2,\infty}$ vs. the noise level $\omega^{\star}$; (b) Relative estimation errors of $\bm{S}-\bm{S}^{\star}$ measured by $\Vert\cdot\Vert$, $\Vert\cdot\Vert_{\mathrm{F}}$ and $\Vert\cdot\Vert_{\infty}$ vs. the noise level $\omega^{\star}$. The results are reported over $200$ independent trials for $r=3$ and $p=0.6$.
  • Figure 2: The relative estimation error of $\bm{U}$ and $\bm{S}$ returned by both SVD-based approach (cf. Algorithm \ref{['alg:PCA-SVD']}) and HeteroPCA (cf. Algorithm \ref{['alg:PCA-HeteroPCA']}) across different missing probability $p$. (a) Relative estimation errors of $\bm{U}\bm{R}-\bm{U}^{\star}$ measured by $\Vert\cdot\Vert$, $\Vert\cdot\Vert_{\mathrm{F}}$ and $\Vert\cdot\Vert_{2,\infty}$ vs. the missing rate $p$; (b) Relative estimation errors of $\bm{S}-\bm{S}^{\star}$ measured by $\Vert\cdot\Vert$, $\Vert\cdot\Vert_{\mathrm{F}}$ and $\Vert\cdot\Vert_{\infty}$ vs. the missing rate $p$. The results are reported over $200$ independent trials for $r=3$ and $\omega^{\star}=0.05$.
  • Figure 3: The relative estimation error of $\bm{U}$ and $\bm{S}$ returned by both diagonal-deleted spectral method cai2019subspace and HeteroPCA (cf. Algorithm \ref{['alg:PCA-HeteroPCA']}). (a) Relative estimation error $\Vert\bm{U}\bm{R}-\bm{U}^{\star}\Vert/\Vert\bm{U}^{\star}\Vert$ vs. dimension $d$; (b) Relative estimation error $\Vert\bm{S}-\bm{S}^{\star}\Vert/\Vert\bm{S}^{\star}\Vert$ vs. the dimension $d$. The results are reported over $200$ independent trials for $r=3$, $\omega^{\star}=0.05$ and $p=0.6$.
  • Figure 4: (a) Q-Q (quantile-quantile) plot of $T_{1}$ vs. the standard normal distribution for the SVD-based approach; (b) Q-Q (quantile-quantile) plot of $T_{1}$ vs. the standard normal distribution for HeteroPCA. The results are reported over $2000$ independent trials for $r=1$, $p=0.6$ and $\omega^{\star}=0.05$.
  • Figure 5: (a) Q-Q (quantile-quantile) plot of $Z_{1,1}$ vs. the standard normal distribution for the SVD-based approach; (b) Q-Q (quantile-quantile) plot of $Z_{1,2}$ vs. a standard Gaussian distribution for the SVD-based approach. The results are reported over $2000$ independent trials for $r=3$, $p=0.6$, $\omega^{\star}=0.05$.
  • ...and 1 more figures

Theorems & Definitions (83)

  • Definition 1: Incoherence
  • Remark 1
  • Theorem 1
  • Remark 2
  • Theorem 2
  • Theorem 3
  • Remark 3
  • Theorem 4
  • Remark 4
  • Theorem 5
  • ...and 73 more