Penalized Principal Component Analysis Using Smoothing
Rebecca M. Hurwitz, Georg Hahn
TL;DR
The paper addresses the challenge of computing sparse and interpretable eigenvectors in high-dimensional genomic settings by reframing PCA as a Penalized Eigenvalue Problem (PEP) with an $L_1$ penalty. It introduces a differentiable, smoothed surrogate for the non-differentiable $L_1$ term, enabling gradient-based optimization, and uses deflation via singular value decomposition to obtain higher-order eigenvectors. Through four experimental studies, including the 1000 Genomes Project, SARS-CoV-2 genome data, and the Iris benchmark, the authors show that smoothing improves numerical stability, clustering discernibility, and predictive performance in polygenic risk score analyses, while maintaining competitive runtime against state-of-the-art sparse PCA methods. The work provides an accessible R implementation (SPEV) and demonstrates broad applicability in genomics, clustering, and dimensionality reduction tasks requiring sparse, interpretable eigenvectors.
Abstract
Principal components computed via PCA (principal component analysis) are traditionally used to reduce dimensionality in genomic data or to correct for population stratification. In this paper, we explore the penalized eigenvalue problem (PEP) which reformulates the computation of the first eigenvector as an optimization problem and adds an $L_1$ penalty constraint to enforce sparseness of the solution. The contribution of our article is threefold. First, we extend PEP by applying smoothing to the original LASSO-type $L_1$ penalty. This allows one to compute analytical gradients which enable faster and more efficient minimization of the objective function associated with the optimization problem. Second, we demonstrate how higher order eigenvectors can be calculated with PEP using established results from singular value decomposition (SVD). Third, we present four experimental studies to demonstrate the usefulness of the smoothed penalized eigenvectors. Using data from the 1000 Genomes Project dataset, we empirically demonstrate that our proposed smoothed PEP allows one to increase numerical stability and obtain meaningful eigenvectors. We also employ the penalized eigenvector approach in two additional real data applications (computation of a polygenic risk score and clustering), demonstrating that exchanging the penalized eigenvectors for their smoothed counterparts can increase prediction accuracy in polygenic risk scores and enhance discernibility of clusterings. Moreover, we compare our proposed smoothed PEP to seven state-of-the-art algorithms for sparse PCA and evaluate the accuracy of the obtained eigenvectors, their support recovery, and their runtime.
