Table of Contents
Fetching ...

scE2TM improves single-cell embedding interpretability and reveals cellular perturbation signatures

Hegang Chen, Yuyin Lu, Yifan Zhao, Zhiming Dai, Fu Lee Wang, Qing Li, Yanghui Rao, Yue Li

TL;DR

This work tackles interpretability in single-cell embeddings by pairing an external knowledge-guided embedded topic model with two key innovations: cross-view encoder to incorporate foundation-model knowledge and embedding clustering regularization to prevent topic collapse. The authors establish a rigorous, quantitative interpretability framework and demonstrate state-of-the-art clustering across 20 scRNA-seq datasets, along with superior topic diversity and pathway relevance. Through pancreas, interferon-perturbed PBMC, and melanoma case studies, scE2TM uncovers biologically meaningful topics, enables in silico perturbations that recapitulate real responses, and shows clinical relevance via TCGA survival associations. Together, these results position scE2TM as a robust, interpretable tool for mechanistic insight and potential therapeutic target discovery in single-cell genomics.

Abstract

Single-cell RNA sequencing technologies have revolutionized our understanding of cellular heterogeneity, yet computational methods often struggle to balance performance with biological interpretability. Embedded topic models have been widely used for interpretable single-cell embedding learning. However, these models suffer from the potential problem of interpretation collapse, where topics semantically collapse towards each other, resulting in redundant topics and incomplete capture of biological variation. Furthermore, the rise of single-cell foundation models creates opportunities to harness external biological knowledge for guiding model embeddings. Here, we present scE2TM, an external knowledge-guided embedded topic model that provides a high-quality cell embedding and interpretation for scRNA-seq analysis. Through embedding clustering regularization method, each topic is constrained to be the center of a separately aggregated gene cluster, enabling it to capture unique biological information. Across 20 scRNA-seq datasets, scE2TM achieves superior clustering performance compared with seven state-of-the-art methods. A comprehensive interpretability benchmark further shows that scE2TM-learned topics exhibit higher diversity and stronger consistency with underlying biological pathways. Modeling interferon-stimulated PBMCs, scE2TM simulates topic perturbations that drive control cells toward stimulated-like transcriptional states, faithfully mirroring experimental interferon responses. In melanoma, scE2TM identifies malignant-specific topics and extrapolates them to unseen patient data, revealing gene programs associated with patient survival.

scE2TM improves single-cell embedding interpretability and reveals cellular perturbation signatures

TL;DR

This work tackles interpretability in single-cell embeddings by pairing an external knowledge-guided embedded topic model with two key innovations: cross-view encoder to incorporate foundation-model knowledge and embedding clustering regularization to prevent topic collapse. The authors establish a rigorous, quantitative interpretability framework and demonstrate state-of-the-art clustering across 20 scRNA-seq datasets, along with superior topic diversity and pathway relevance. Through pancreas, interferon-perturbed PBMC, and melanoma case studies, scE2TM uncovers biologically meaningful topics, enables in silico perturbations that recapitulate real responses, and shows clinical relevance via TCGA survival associations. Together, these results position scE2TM as a robust, interpretable tool for mechanistic insight and potential therapeutic target discovery in single-cell genomics.

Abstract

Single-cell RNA sequencing technologies have revolutionized our understanding of cellular heterogeneity, yet computational methods often struggle to balance performance with biological interpretability. Embedded topic models have been widely used for interpretable single-cell embedding learning. However, these models suffer from the potential problem of interpretation collapse, where topics semantically collapse towards each other, resulting in redundant topics and incomplete capture of biological variation. Furthermore, the rise of single-cell foundation models creates opportunities to harness external biological knowledge for guiding model embeddings. Here, we present scE2TM, an external knowledge-guided embedded topic model that provides a high-quality cell embedding and interpretation for scRNA-seq analysis. Through embedding clustering regularization method, each topic is constrained to be the center of a separately aggregated gene cluster, enabling it to capture unique biological information. Across 20 scRNA-seq datasets, scE2TM achieves superior clustering performance compared with seven state-of-the-art methods. A comprehensive interpretability benchmark further shows that scE2TM-learned topics exhibit higher diversity and stronger consistency with underlying biological pathways. Modeling interferon-stimulated PBMCs, scE2TM simulates topic perturbations that drive control cells toward stimulated-like transcriptional states, faithfully mirroring experimental interferon responses. In melanoma, scE2TM identifies malignant-specific topics and extrapolates them to unseen patient data, revealing gene programs associated with patient survival.

Paper Structure

This paper contains 34 sections, 22 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Schematic overview of scE$^2$TM. (a) Cross-view encoder. This encoder integrates single-cell expression data with embeddings extracted from a single-cell foundation model. To combine these two perspectives, cluster and topic heads are trained based on the mutual neighborhood information by encouraging consistent clustering assignments of mutual nearest neighbors of the corresponding cells. (b) Embedding clustering regularization module. ECR clusters gene embeddings $\mathbf{g}_{j}$ ($\textcolor{myblue}{\bullet}$) as samples and topic embeddings $\mathbf{t}_{k}$ ($\textcolor{myred}{\star}$) as centers with soft-assignment $\pi_{\epsilon, j k}^{*}$. For instance, ECR pushes $\mathbf{g}_{1}$ and $\mathbf{g}_{2}$ close to $\mathbf{t}_{1}$ and away from $\mathbf{t}_{3}$ and $\mathbf{t}_{5}$. (c) Sparse linear decoder. Decoder learns topic embeddings and gene embeddings as well as sparse topic-gene dependencies during reconstruction, thereby ensuring model interpretability.
  • Figure 2: Cell-type clustering benchmark. (a) Cell type clustering performance evaluation in terms of Adjusted Random Index (ARI). The 20 panels on the left show the ARI values for the 8 benchmark methods on 20 datasets. The panel on the right displays the average ARI values and standard deviations. The statistical significance between scE$^2$M and the second best method was tested by pairwise Mann-Whitney U-test. Results for scTAG and scLEGA on some of the large scRNA-seq datasets (Karagiannis, Orozco, Hrvatin, and Schaum_tmuris) were not shown because of their limited scalability. (b) UMAP visualization on the Usoskin dataset. UMAP was performed on the embeddings from each benchmark method on the Usoskin dataset. Cell types include peptidergic nociceptors (PEP), non-peptidergic nociceptors (NP), neurofilament (NF), and tyrosine hydroxylase (TH). (c) Cell-type clustering accuracy. The percentages of accuracy improvement achieved by scE$^2$TM relative to the baselines are labeled on the corresponding bar plots.
  • Figure 3: Interpretability metrics comparison. Quantitative assessment of interpretability using six metrics including Interpretation Purity (IP), Topic Diversity (TD), Topic Coherence (TC), Topic Quality (TQ), GSEA quality (GSEA$_Q$), and ORA quality (ORA$_Q$) across 20 scRNA-seq datasets. The percentage gain of scE$^2$TM over the second best method was marked on each panel.
  • Figure 4: Correlation between interpretability and clustering metrics. Pairwise scatter plots in the lower triangle illustrate the relationships between each pair of metrics. Each point corresponds to the metric scores of one model-dataset pair, covering five single-cell embedded topic models applied to twenty datasets. The kernel density along the diagonal estimates their marginal distributions. Pearson correlation coefficients ($r$) and their statistical significance are summarized in the accompanying heatmap in the upper triangle. Each cell reports the $r$ value together with its significance level ($p$ < 0.05: $*$, $p$ < 0.01: $**$, $p$ < 0.001: $***$).
  • Figure 5: Analysis of scE$^2$TM topic and gene embeddings on human pancreas scRNA-seq data. (a) Topic and gene embedding. UMAP visualization shows the global distribution of genes (${\bullet}$) and topics (${\star}$) in the learned embedding space. (b) Visualization of the embedding for topic 12 and 69 and their top-$10$ genes. (c) Gene Set Enrichment Analysis (GSEA) of topic 12 and 69. Leading-edge analysis was performed on the "REACTOME_INSULIN_PROCESSING" pathways using Topic 12. The running sum enrichment score is calculated by GSEA. (d) GSEA analysis of topic 69 on "VANGURP_PANCREATIC_BETA_CELL" pathway. (e) Topic-gene similarity. Comparison of cosine similarity between each topic and its top genes.
  • ...and 3 more figures