Table of Contents
Fetching ...

Multiscale Grassmann Manifolds for Single-Cell Data Analysis

Xiang Xiang Wang, Sean Cottrell, Guo-Wei Wei

TL;DR

The paper addresses the challenge that conventional Euclidean representations struggle to capture the intrinsic non-Euclidean geometry and multiscale structure of single-cell data. It introduces a multiscale Grassmann manifolds (MGM) framework that embeds cells as subspaces on the Grassmann manifold $Gr(n,p)$ by aggregating multiple scale embeddings, and uses Grassmann-distance measures to form a global affinity for clustering. A power-based scale sampling function selects scales to balance local and global information, enabling robust, multiscale representations. Experiments across nine public scRNA-seq datasets show that MGM yields stable embeddings and competitive or superior clustering performance, especially for small to medium-sized datasets, highlighting the value of integrating multiscale geometric information on non-Euclidean manifolds.

Abstract

Single-cell data analysis seeks to characterize cellular heterogeneity based on high-dimensional gene expression profiles. Conventional approaches represent each cell as a vector in Euclidean space, which limits their ability to capture intrinsic correlations and multiscale geometric structures. We propose a multiscale framework based on Grassmann manifolds that integrates machine learning with subspace geometry for single-cell data analysis. By generating embeddings under multiple representation scales, the framework combines their features from different geometric views into a unified Grassmann manifold. A power-based scale sampling function is introduced to control the selection of scales and balance in- formation across resolutions. Experiments on nine benchmark single-cell RNA-seq datasets demonstrate that the proposed approach effectively preserves meaningful structures and provides stable clustering performance, particularly for small to medium-sized datasets. These results suggest that Grassmann manifolds offer a coherent and informative foundation for analyzing single cell data.

Multiscale Grassmann Manifolds for Single-Cell Data Analysis

TL;DR

The paper addresses the challenge that conventional Euclidean representations struggle to capture the intrinsic non-Euclidean geometry and multiscale structure of single-cell data. It introduces a multiscale Grassmann manifolds (MGM) framework that embeds cells as subspaces on the Grassmann manifold by aggregating multiple scale embeddings, and uses Grassmann-distance measures to form a global affinity for clustering. A power-based scale sampling function selects scales to balance local and global information, enabling robust, multiscale representations. Experiments across nine public scRNA-seq datasets show that MGM yields stable embeddings and competitive or superior clustering performance, especially for small to medium-sized datasets, highlighting the value of integrating multiscale geometric information on non-Euclidean manifolds.

Abstract

Single-cell data analysis seeks to characterize cellular heterogeneity based on high-dimensional gene expression profiles. Conventional approaches represent each cell as a vector in Euclidean space, which limits their ability to capture intrinsic correlations and multiscale geometric structures. We propose a multiscale framework based on Grassmann manifolds that integrates machine learning with subspace geometry for single-cell data analysis. By generating embeddings under multiple representation scales, the framework combines their features from different geometric views into a unified Grassmann manifold. A power-based scale sampling function is introduced to control the selection of scales and balance in- formation across resolutions. Experiments on nine benchmark single-cell RNA-seq datasets demonstrate that the proposed approach effectively preserves meaningful structures and provides stable clustering performance, particularly for small to medium-sized datasets. These results suggest that Grassmann manifolds offer a coherent and informative foundation for analyzing single cell data.

Paper Structure

This paper contains 16 sections, 22 equations, 4 figures, 12 tables, 2 algorithms.

Figures (4)

  • Figure 1: Overview of the proposed multiscale Grassmann manifolds (MGM) framework. Starting from a single-cell gene expression matrix, multiple low-dimensional embeddings are generated under different neighborhood sizes ($s_1,\ldots,s_p$) using a chosen dimensionality reduction method. For each cell, the resulting multiscale feature vectors are aggregated into a matrix whose column space defines a subspace on the Grassmann manifold ${\hbox{\bf Gr}}(n,p)$. Pairwise distances between these subspaces are then computed to form a pairwise distance matrix, which can be used for downstream analyses such as clustering.
  • Figure 2: Average clustering performance across all datasets under the noisy condition (Setup I), comparing MGM, Avg-UMAP, and PCA. Panels correspond to four evaluation metrics: ACC, NMI, ARI, and Purity.
  • Figure 3: Average clustering performance under the refined condition (Setup II), comparing MGM, NMF, rNMF, Avg-UMAP, and PCA across five evaluation metrics (ACC, NMI, ARI, Purity, and Avg-Purity). Each panel shows the mean score of one metric across all datasets.
  • Figure 4: Visualization under UMAP (MGM vs. PCA$\rightarrow$UMAP vs. UMAP). Columns correspond to datasets (GSE67835, GSE75748time, GSE109979, GSE75748cell, GSE94820), and rows correspond to methods. Top row: MGM, where chordal distances are computed from multiscale subspace representations and then embedded using UMAP with a precomputed metric. Middle row: PCA$\rightarrow$UMAP, obtained by applying UMAP to PCA-reduced features using the Euclidean metric. Bottom row: UMAP, obtained directly from the preprocessed data without PCA or Grassmann manifold representation.

Theorems & Definitions (5)

  • Definition 2.1: Principal Angles
  • Definition 2.2: Subspace Representation
  • Remark 3.1
  • Remark 4.1
  • Remark 4.2