Dimension estimation in PCA model using high-dimensional data augmentation
Una Radojicic, Joni Virta
TL;DR
The paper tackles latent-dimension estimation in PCA under high-dimensional data by first analyzing predictor augmentation and showing that the original approach can be inconsistent when both data and augmentation dimensions grow with the sample size. It then proposes a high-dimensional predictor augmentation (HDPA) that debiases spike eigenvalues from the original data, adjusts eigenvector-norm information, and identifies the latent dimension by a jump in a carefully constructed criterion, proving consistency under mild conditions. The authors provide theoretical results on the limits of augmented-eigenstructure in high dimensions and illustrate that, unlike the original method, HDPA remains reliable across a broad range of $\gamma_p$ and $\gamma_r$ regimes. Simulations demonstrate substantial improvements over competing methods, including robustness to non-Gaussian data, and practical guidance for noise-variance estimation and augmentation tuning.
Abstract
We propose a modified, high-dimensional version of a recent dimension estimation procedure that determines the dimension via the introduction of augmented noise variables into the data. Our asymptotic results show that the proposal is consistent in wide high-dimensional scenarios, and further shed light on why the original method breaks down when the dimension of either the data or the augmentation becomes too large. Simulations are used to demonstrate the superiority of the proposal to competitors both under and outside of the theoretical model.
