Sparse PCA with False Discovery Rate Controlled Variable Selection
Jasin Machkour, Arnaud Breloy, Michael Muma, Daniel P. Palomar, Frédéric Pascal
TL;DR
This work addresses the limitation of variance-driven sparse PCA by introducing a false discovery rate–controlled approach. By embedding the elastic net SPCA within the Terminating-Random Experiments (T-Rex) selector, it automatically yields sparse loading supports with provable FDR control, eliminating the need for sparsity parameter tuning. The method constructs FDR-controlled loading sets for each principal component, derives loading vectors via ridge regression on selected features, and forms sparse, interpretable PCs that capture signal with minimal contamination from noise. Empirical results on synthetic data and real stock-return data show improved interpretability and effective variance explanation, highlighting practical benefits for high-dimensional data analysis.
Abstract
Sparse principal component analysis (PCA) aims at mapping large dimensional data to a linear subspace of lower dimension. By imposing loading vectors to be sparse, it performs the double duty of dimension reduction and variable selection. Sparse PCA algorithms are usually expressed as a trade-off between explained variance and sparsity of the loading vectors (i.e., number of selected variables). As a high explained variance is not necessarily synonymous with relevant information, these methods are prone to select irrelevant variables. To overcome this issue, we propose an alternative formulation of sparse PCA driven by the false discovery rate (FDR). We then leverage the Terminating-Random Experiments (T-Rex) selector to automatically determine an FDR-controlled support of the loading vectors. A major advantage of the resulting T-Rex PCA is that no sparsity parameter tuning is required. Numerical experiments and a stock market data example demonstrate a significant performance improvement.
