Table of Contents
Fetching ...

Penalized Principal Component Analysis Using Smoothing

Rebecca M. Hurwitz, Georg Hahn

TL;DR

The paper addresses the challenge of computing sparse and interpretable eigenvectors in high-dimensional genomic settings by reframing PCA as a Penalized Eigenvalue Problem (PEP) with an $L_1$ penalty. It introduces a differentiable, smoothed surrogate for the non-differentiable $L_1$ term, enabling gradient-based optimization, and uses deflation via singular value decomposition to obtain higher-order eigenvectors. Through four experimental studies, including the 1000 Genomes Project, SARS-CoV-2 genome data, and the Iris benchmark, the authors show that smoothing improves numerical stability, clustering discernibility, and predictive performance in polygenic risk score analyses, while maintaining competitive runtime against state-of-the-art sparse PCA methods. The work provides an accessible R implementation (SPEV) and demonstrates broad applicability in genomics, clustering, and dimensionality reduction tasks requiring sparse, interpretable eigenvectors.

Abstract

Principal components computed via PCA (principal component analysis) are traditionally used to reduce dimensionality in genomic data or to correct for population stratification. In this paper, we explore the penalized eigenvalue problem (PEP) which reformulates the computation of the first eigenvector as an optimization problem and adds an $L_1$ penalty constraint to enforce sparseness of the solution. The contribution of our article is threefold. First, we extend PEP by applying smoothing to the original LASSO-type $L_1$ penalty. This allows one to compute analytical gradients which enable faster and more efficient minimization of the objective function associated with the optimization problem. Second, we demonstrate how higher order eigenvectors can be calculated with PEP using established results from singular value decomposition (SVD). Third, we present four experimental studies to demonstrate the usefulness of the smoothed penalized eigenvectors. Using data from the 1000 Genomes Project dataset, we empirically demonstrate that our proposed smoothed PEP allows one to increase numerical stability and obtain meaningful eigenvectors. We also employ the penalized eigenvector approach in two additional real data applications (computation of a polygenic risk score and clustering), demonstrating that exchanging the penalized eigenvectors for their smoothed counterparts can increase prediction accuracy in polygenic risk scores and enhance discernibility of clusterings. Moreover, we compare our proposed smoothed PEP to seven state-of-the-art algorithms for sparse PCA and evaluate the accuracy of the obtained eigenvectors, their support recovery, and their runtime.

Penalized Principal Component Analysis Using Smoothing

TL;DR

The paper addresses the challenge of computing sparse and interpretable eigenvectors in high-dimensional genomic settings by reframing PCA as a Penalized Eigenvalue Problem (PEP) with an penalty. It introduces a differentiable, smoothed surrogate for the non-differentiable term, enabling gradient-based optimization, and uses deflation via singular value decomposition to obtain higher-order eigenvectors. Through four experimental studies, including the 1000 Genomes Project, SARS-CoV-2 genome data, and the Iris benchmark, the authors show that smoothing improves numerical stability, clustering discernibility, and predictive performance in polygenic risk score analyses, while maintaining competitive runtime against state-of-the-art sparse PCA methods. The work provides an accessible R implementation (SPEV) and demonstrates broad applicability in genomics, clustering, and dimensionality reduction tasks requiring sparse, interpretable eigenvectors.

Abstract

Principal components computed via PCA (principal component analysis) are traditionally used to reduce dimensionality in genomic data or to correct for population stratification. In this paper, we explore the penalized eigenvalue problem (PEP) which reformulates the computation of the first eigenvector as an optimization problem and adds an penalty constraint to enforce sparseness of the solution. The contribution of our article is threefold. First, we extend PEP by applying smoothing to the original LASSO-type penalty. This allows one to compute analytical gradients which enable faster and more efficient minimization of the objective function associated with the optimization problem. Second, we demonstrate how higher order eigenvectors can be calculated with PEP using established results from singular value decomposition (SVD). Third, we present four experimental studies to demonstrate the usefulness of the smoothed penalized eigenvectors. Using data from the 1000 Genomes Project dataset, we empirically demonstrate that our proposed smoothed PEP allows one to increase numerical stability and obtain meaningful eigenvectors. We also employ the penalized eigenvector approach in two additional real data applications (computation of a polygenic risk score and clustering), demonstrating that exchanging the penalized eigenvectors for their smoothed counterparts can increase prediction accuracy in polygenic risk scores and enhance discernibility of clusterings. Moreover, we compare our proposed smoothed PEP to seven state-of-the-art algorithms for sparse PCA and evaluate the accuracy of the obtained eigenvectors, their support recovery, and their runtime.
Paper Structure (20 sections, 14 equations, 7 figures, 7 tables)

This paper contains 20 sections, 14 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Population stratification for the 1000 Genomes Project data set using unsmoothed PEP of eq. \ref{['eq:pep']}, evaluated at $\lambda \in \{0,1,10,100\}$.
  • Figure 2: Population stratification for the 1000 Genomes Project data set using smoothed PEP of eq. \ref{['eq:smoothed_pep']}, evaluated at $\lambda \in \{0,1,10,100\}$.
  • Figure 3: Population stratification for the 1000 Genomes Project data set using smoothed PEP of eq. \ref{['eq:smoothed_pep']} with Lasso penalty $\lambda=1$. Varying smoothing parameter $\mu \in \{10^{-2},10^{-1},1,10\}$ encoded with black ($\mu = 0.01$), blue ($\mu = 0.1$), green ($\mu = 1$), and red ($\mu = 10$). Log scale on both axes computed as $\log(1+x)$ and $\text{sign}(y) \log(1+|y|)$.
  • Figure 4: AUC metric for the prediction of SARS-CoV-2 mortality as a function of the proportion of the data used for training.
  • Figure 5: Iris benchmark dataset of the UC Irvine Machine Learning Repository. Clustering computed with unsmoothed (left) and smoothed (right) PEP. Points are colored by their true label (the name of the species).
  • ...and 2 more figures