Table of Contents
Fetching ...

Sparse PCA with False Discovery Rate Controlled Variable Selection

Jasin Machkour, Arnaud Breloy, Michael Muma, Daniel P. Palomar, Frédéric Pascal

TL;DR

This work addresses the limitation of variance-driven sparse PCA by introducing a false discovery rate–controlled approach. By embedding the elastic net SPCA within the Terminating-Random Experiments (T-Rex) selector, it automatically yields sparse loading supports with provable FDR control, eliminating the need for sparsity parameter tuning. The method constructs FDR-controlled loading sets for each principal component, derives loading vectors via ridge regression on selected features, and forms sparse, interpretable PCs that capture signal with minimal contamination from noise. Empirical results on synthetic data and real stock-return data show improved interpretability and effective variance explanation, highlighting practical benefits for high-dimensional data analysis.

Abstract

Sparse principal component analysis (PCA) aims at mapping large dimensional data to a linear subspace of lower dimension. By imposing loading vectors to be sparse, it performs the double duty of dimension reduction and variable selection. Sparse PCA algorithms are usually expressed as a trade-off between explained variance and sparsity of the loading vectors (i.e., number of selected variables). As a high explained variance is not necessarily synonymous with relevant information, these methods are prone to select irrelevant variables. To overcome this issue, we propose an alternative formulation of sparse PCA driven by the false discovery rate (FDR). We then leverage the Terminating-Random Experiments (T-Rex) selector to automatically determine an FDR-controlled support of the loading vectors. A major advantage of the resulting T-Rex PCA is that no sparsity parameter tuning is required. Numerical experiments and a stock market data example demonstrate a significant performance improvement.

Sparse PCA with False Discovery Rate Controlled Variable Selection

TL;DR

This work addresses the limitation of variance-driven sparse PCA by introducing a false discovery rate–controlled approach. By embedding the elastic net SPCA within the Terminating-Random Experiments (T-Rex) selector, it automatically yields sparse loading supports with provable FDR control, eliminating the need for sparsity parameter tuning. The method constructs FDR-controlled loading sets for each principal component, derives loading vectors via ridge regression on selected features, and forms sparse, interpretable PCs that capture signal with minimal contamination from noise. Empirical results on synthetic data and real stock-return data show improved interpretability and effective variance explanation, highlighting practical benefits for high-dimensional data analysis.

Abstract

Sparse principal component analysis (PCA) aims at mapping large dimensional data to a linear subspace of lower dimension. By imposing loading vectors to be sparse, it performs the double duty of dimension reduction and variable selection. Sparse PCA algorithms are usually expressed as a trade-off between explained variance and sparsity of the loading vectors (i.e., number of selected variables). As a high explained variance is not necessarily synonymous with relevant information, these methods are prone to select irrelevant variables. To overcome this issue, we propose an alternative formulation of sparse PCA driven by the false discovery rate (FDR). We then leverage the Terminating-Random Experiments (T-Rex) selector to automatically determine an FDR-controlled support of the loading vectors. A major advantage of the resulting T-Rex PCA is that no sparsity parameter tuning is required. Numerical experiments and a stock market data example demonstrate a significant performance improvement.
Paper Structure (8 sections, 12 equations, 4 figures, 1 algorithm)

This paper contains 8 sections, 12 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: Simplified T-Rex selector framework machkour2021terminatingmachkour2022TRexGVS.
  • Figure 2: For the first PC, the proposed T-Rex PCA methods empirically control the FDR at a level of $10$% while achieving an optimal TPR of $100$% even at low SNRs. Only the infeasible oracle thresholded PCA achieves the same TPR at an FDR of almost zero. Except for high SNRs, the oracle SPCA is dominated by all other methods.
  • Figure 3: Cumulative percentage of explained variance (PEV): (a) - (c) As desired, the proposed T-Rex PCA and T-Rex Thresholded PCA require only very few PCs to explain the signal and mixed variance while not explaining any additional variance that is purely associated with null loadings. The oracle SPCA is outperformed by all other methods and the ordinary PCA explains all the variance in the data, including the variance that is merely associated with null loadings. (d) The cumulative PEV is not very sensitive with respect to the choice of the target FDR level for the T-Rex PCA, which allows the user to set almost any (preferably low) target FDR and still achieve a high cumulative PEV.
  • Figure 4: Correlation matrices of the $28$ most influential stocks (according to their index weights) in the S&P $500$ index.

Theorems & Definitions (1)

  • Definition 1