Table of Contents
Fetching ...

FeatPCA: A feature subspace based principal component analysis technique for enhancing clustering of single-cell RNA-seq data

Md Romizul Islam, Swakkhar Shatabda

TL;DR

FeatPCA tackles clustering of high-dimensional scRNA-seq data with $n$ cells and $d$ genes by partitioning the feature space into $k$ subspaces, applying PCA within each, and merging embeddings to obtain a final $n\times\sum_i m_i$ representation for clustering. The method uses a denoising autoencoder to impute missing values and compares four subspace-generation strategies, reporting improvements in ARI across seven datasets compared with full-data PCA and with state-of-the-art tools SC3, Seurat, and FEATS. The results demonstrate that subspace PCA often yields higher ARI than traditional full-data PCA, with sequential subspacing frequently providing the strongest gains (e.g., Yan and Pollen datasets). These findings suggest a scalable, modular workflow for single-cell analysis that can be extended to multimodal data and other clustering pipelines.

Abstract

Single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to analyze gene expression at the cellular level. By providing data on gene expression for each individual cell, scRNA-seq generates large datasets with thousands of genes. However, handling such high-dimensional data poses computational challenges due to increased complexity. Dimensionality reduction becomes crucial for scRNA-seq analysis. Various dimensionality reduction algorithms, including Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and t-Distributed Stochastic Neighbor Embedding (t-SNE), are commonly used to address this challenge. These methods transform the original high-dimensional data into a lower-dimensional representation while preserving relevant information. In this paper we propose {\methodname}. Instead of applying dimensionality reduction directly to the entire dataset, we divide it into multiple subspaces. Within each subspace, we apply dimension reduction techniques, and then merge the reduced data. {\methodname} offers four variations for subspacing. Our experimental results demonstrate that clustering based on subspacing yields better accuracy than working with the full dataset. Across a variety of scRNA-seq datasets, {\methodname} consistently outperforms existing state-of-the-art clustering tools.

FeatPCA: A feature subspace based principal component analysis technique for enhancing clustering of single-cell RNA-seq data

TL;DR

FeatPCA tackles clustering of high-dimensional scRNA-seq data with cells and genes by partitioning the feature space into subspaces, applying PCA within each, and merging embeddings to obtain a final representation for clustering. The method uses a denoising autoencoder to impute missing values and compares four subspace-generation strategies, reporting improvements in ARI across seven datasets compared with full-data PCA and with state-of-the-art tools SC3, Seurat, and FEATS. The results demonstrate that subspace PCA often yields higher ARI than traditional full-data PCA, with sequential subspacing frequently providing the strongest gains (e.g., Yan and Pollen datasets). These findings suggest a scalable, modular workflow for single-cell analysis that can be extended to multimodal data and other clustering pipelines.

Abstract

Single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to analyze gene expression at the cellular level. By providing data on gene expression for each individual cell, scRNA-seq generates large datasets with thousands of genes. However, handling such high-dimensional data poses computational challenges due to increased complexity. Dimensionality reduction becomes crucial for scRNA-seq analysis. Various dimensionality reduction algorithms, including Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and t-Distributed Stochastic Neighbor Embedding (t-SNE), are commonly used to address this challenge. These methods transform the original high-dimensional data into a lower-dimensional representation while preserving relevant information. In this paper we propose {\methodname}. Instead of applying dimensionality reduction directly to the entire dataset, we divide it into multiple subspaces. Within each subspace, we apply dimension reduction techniques, and then merge the reduced data. {\methodname} offers four variations for subspacing. Our experimental results demonstrate that clustering based on subspacing yields better accuracy than working with the full dataset. Across a variety of scRNA-seq datasets, {\methodname} consistently outperforms existing state-of-the-art clustering tools.

Paper Structure

This paper contains 16 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Pipeline of the FeatPCA algorithm
  • Figure 2: ARI value for sequencial subspacing
  • Figure 3: ARI value for shuffled sequencial subspacing based subspacing
  • Figure 4: ARI value for randomly selected genes
  • Figure 5: ARI value based on performing gene clustering