Table of Contents
Fetching ...

Random Matrix Theory-guided sparse PCA for single-cell RNA-seq data

Victor Chardès

TL;DR

This work introduces a novel biwhitening algorithm which self-consistently estimates the magnitude of transcriptomic noise affecting each gene in individual cells, without assuming a specific noise distribution, and enables the use of an RMT-based criterion to automatically select the sparsity level, rendering sparse PCA nearly parameter-free.

Abstract

Single-cell RNA-seq provides detailed molecular snapshots of individual cells but is notoriously noisy. Variability stems from biological differences and technical factors, such as amplification bias and limited RNA capture efficiency, making it challenging to adapt computational pipelines to heterogeneous datasets or evolving technologies. As a result, most studies still rely on principal component analysis (PCA) for dimensionality reduction, valued for its interpretability and robustness, in spite of its known bias in high dimensions. Here, we improve upon PCA with a Random Matrix Theory (RMT)-based approach that guides the inference of sparse principal components using existing sparse PCA algorithms. We first introduce a novel biwhitening algorithm which self-consistently estimates the magnitude of transcriptomic noise affecting each gene in individual cells, without assuming a specific noise distribution. This enables the use of an RMT-based criterion to automatically select the sparsity level, rendering sparse PCA nearly parameter-free. Our mathematically grounded approach retains the interpretability of PCA while enabling robust, hands-off inference of sparse principal components. Across seven single-cell RNA-seq technologies and four sparse PCA algorithms, we show that this method systematically improves the reconstruction of the principal subspace and consistently outperforms PCA-, autoencoder-, and diffusion-based methods in cell-type classification tasks.

Random Matrix Theory-guided sparse PCA for single-cell RNA-seq data

TL;DR

This work introduces a novel biwhitening algorithm which self-consistently estimates the magnitude of transcriptomic noise affecting each gene in individual cells, without assuming a specific noise distribution, and enables the use of an RMT-based criterion to automatically select the sparsity level, rendering sparse PCA nearly parameter-free.

Abstract

Single-cell RNA-seq provides detailed molecular snapshots of individual cells but is notoriously noisy. Variability stems from biological differences and technical factors, such as amplification bias and limited RNA capture efficiency, making it challenging to adapt computational pipelines to heterogeneous datasets or evolving technologies. As a result, most studies still rely on principal component analysis (PCA) for dimensionality reduction, valued for its interpretability and robustness, in spite of its known bias in high dimensions. Here, we improve upon PCA with a Random Matrix Theory (RMT)-based approach that guides the inference of sparse principal components using existing sparse PCA algorithms. We first introduce a novel biwhitening algorithm which self-consistently estimates the magnitude of transcriptomic noise affecting each gene in individual cells, without assuming a specific noise distribution. This enables the use of an RMT-based criterion to automatically select the sparsity level, rendering sparse PCA nearly parameter-free. Our mathematically grounded approach retains the interpretability of PCA while enabling robust, hands-off inference of sparse principal components. Across seven single-cell RNA-seq technologies and four sparse PCA algorithms, we show that this method systematically improves the reconstruction of the principal subspace and consistently outperforms PCA-, autoencoder-, and diffusion-based methods in cell-type classification tasks.

Paper Structure

This paper contains 20 sections, 23 equations, 18 figures, 2 algorithms.

Figures (18)

  • Figure 1: Validity of the separable covariance model. The factors $A$ and $B$ are estimated using our Sinkhorn-Knopp Biwhitening algorithm. a. Numerical solution for the density $\rho_S$ from Eq. \ref{['sol_stieljes']} and Eq. \ref{['fpoint_g2']} using empirical estimators for $\rho_A$ and $\rho_B$. b. Eigenspectrum of the covariance matrix after biwhitening, shown alongside the Marchenko-Pastur distribution (dashed red line). The inset highlights the spectral edge, with outlier eigenvalues in pink. The close agreement with the Marchenko-Pastur prediction is supported by the small Kolmogorov-Smirnov (KS) distance and large $p$-value luecken2021sandbox. c. KS distance between the covariance eigenspectrum after biwhitening (red), gene-wise $z$-scoring (red), gene-wise $z$-scoring followed by cell-wise $z$-scoring (blue), and gene-wise $z$-scoring followed by cell-wise $L_2$ normalization (orange). All datasets were processed using $2500$ highly variable genes, following the flow charts in Fig. \ref{['fig:figSpipe']} (see SI Appendix \ref{['sec:reproducibility']}) For a and b, the dataset used is Luecken2021.
  • Figure 2: Noise reduction obtained with sparse PCA.a. Noise reduction as a function of $\gamma/\gamma^*$ for five different sparse PCA algorithms. For AManPG we use two $L_2$ penalties in its elastic-net regularization: no penalty (red) and a penalty of $10^4$ (green). For our FISTA implementation we compare two orthogonalization methods: Gram–Schmidt (brown) and Löwdin (blue). In pink, we apply sklearn algorithm after gene-wise z-scoring (whitening) instead of biwhitening. b. Error reduction at $\gamma = 0.6 \gamma^*$ for all datasets using the top-performing methods, with an average reduction of $\sim 30\%$. The colors correspond to the legend in a. For all datasets we used $2000$ highly variable genes, and preprocessing before biwhitening/whitening and sparse PCA followed the same steps as in Fig. \ref{['fig:fig1']}.
  • Figure 3: Comparison of cell type annotation performance for state-of-the-art methods. Improvement over PCA in out-of-bag error for bagging of $30$$k$-NN classifiers with network weights inversely proportional to the distance between points and $n=25$ neighbors. The methods are ranked from best (top) to worst (bottom), as measured across classifiers using constant weights, inverse-distance weights, UMAP weights, and $n=10, 25, 45$ neighbors. To further reduce the variance of the classification error, we exclude from the analysis cell types represented by fewer than $30$ cells. The best-performing method (grey) uses all cells in the dataset ($\geq 30000$) to compute PCs, which are then used to project the subset of cells used in the benchmark. The next best performers are all RMT-guided sparse PCA methods. We use $3000$ cells and $2000$ highly variable genes. Complete benchmark flow charts for each dataset are shown in Fig. \ref{['fig:figSpipe_zheng']}-\ref{['fig:figSpipe_stuart']}. Each experiment was repeated $10$ times with a different cell subset selected at random. Benchmark results for other types of $k$-NN classifiers are shown in Fig. \ref{['fig:figSknn']}.
  • Figure : Sinkhorn-Knopp Biwhitening
  • Figure S1: Outlier eigenvalues for mixtures of information-plus-noise and separable spiked covariance models. In this figure we use $n = 3180$ and $p = 3990$. We define a vector $u_1 \in \mathbb{R}^p$ with $u_{1,1} \approx 0.24$, $u_{1,2} \approx 0.97$, and all other entries zero. For the correlated mixture, we define a vector $u_2 \in \mathbb{R}^p$ with $u_{2,1} \approx 0.92$, $u_{2,2} \approx 0.39$, and all other entries zero, while for independent mixtures $u_2$ has entries chosen uniformly at random. We also define a vector $u_3 \in \mathbb{R}^n$ with entries chosen uniformly at random. All vectors are normalized. The matrices $A$ and $B$ are diagonal. We take $B$ with $60\%$ of its entries equal to $12$ and the rest equal to $1$. The low-rank signals are defined as $Q = 45 u_1 u_1^T$ and $P = 2.5 u_3 u_2^T$. The data is generated as $X = A^{1/2} Y (B + Q)^{1/2} + P$ with entries of $Y$ being independent standard normal variables. The same realization $Y$ is used across all four subfigures. a, b,$30\%$ of the entries of $A$ are set to $8$ and the rest to $0.1$. The signal eigenvalues $\alpha$ are computed as the isolated eigenvalues of $(B+Q)$. c, d, $A$ is the identity matrix. The signal eigenvalues $\alpha$ are computed as the isolated eigenvalues of $\mathbb{E}[S] = I + Q + P^TP/n$. a, c. Independent mixture. b, d. Correlated mixture.
  • ...and 13 more figures