Table of Contents
Fetching ...

Constructing Cell-type Taxonomy by Optimal Transport with Relaxed Marginal Constraints

Sebastian Pena, Lin Lin, Jia Li

TL;DR

The paper addresses cross-sample cell-type annotation in scRNA-seq by constructing a taxonomy of clusters across multiple samples, accommodating missing and new cell types. It introduces Multisample OT Taxonomy (MOTT), which combines Optimal Transport with Relaxed Marginal Constraints (OT-RMC) and Simultaneous Alignment to compute cross-sample cluster similarities, forming an overall similarity matrix $\mathbf{B}$ that is converted to a distance matrix $\mathbf{A}$ via $-\log(\cdot)$ and then partitioned with Ward linkage into meta-clusters. Clusters are modeled as Gaussians $N(\mu_k^{(i)}, \Sigma_k^{(i)})$ and cross-cluster distances use the squared-Wasserstein distance $D_W^2$; OT-RMC yields a matching $\mathbf{W}^*$ with induced cluster proportions and allows unmatched clusters through gap penalties $\lambda L(\mathbf{g})$. Empirically, MOTT achieves higher taxonomy accuracy and better sample-level classification than baselines on 11 datasets, and code is released, demonstrating robust cross-sample labeling without pooling or explicit batch correction.

Abstract

The rapid emergence of single-cell data has facilitated the study of many different biological conditions at the cellular level. Cluster analysis has been widely applied to identify cell types, capturing the essential patterns of the original data in a much more concise form. One challenge in the cluster analysis of cells is matching clusters extracted from datasets of different origins or conditions. Many existing algorithms cannot recognize new cell types present in only one of the two samples when establishing a correspondence between clusters obtained from two samples. Additionally, when there are more than two samples, it is advantageous to align clusters across all samples simultaneously rather than performing pairwise alignment. Our approach aims to construct a taxonomy for cell clusters across all samples to better annotate these clusters and effectively extract features for downstream analysis. A new system for constructing cell-type taxonomy has been developed by combining the technique of Optimal Transport with Relaxed Marginal Constraints (OT-RMC) and the simultaneous alignment of clusters across multiple samples. OT-RMC allows us to address challenges that arise when the proportions of clusters vary substantially between samples or when some clusters do not appear in all the samples. Experiments on more than twenty datasets demonstrate that the taxonomy constructed by this new system can yield highly accurate annotation of cell types. Additionally, sample-level features extracted based on the taxonomy result in accurate classification of samples.

Constructing Cell-type Taxonomy by Optimal Transport with Relaxed Marginal Constraints

TL;DR

The paper addresses cross-sample cell-type annotation in scRNA-seq by constructing a taxonomy of clusters across multiple samples, accommodating missing and new cell types. It introduces Multisample OT Taxonomy (MOTT), which combines Optimal Transport with Relaxed Marginal Constraints (OT-RMC) and Simultaneous Alignment to compute cross-sample cluster similarities, forming an overall similarity matrix that is converted to a distance matrix via and then partitioned with Ward linkage into meta-clusters. Clusters are modeled as Gaussians and cross-cluster distances use the squared-Wasserstein distance ; OT-RMC yields a matching with induced cluster proportions and allows unmatched clusters through gap penalties . Empirically, MOTT achieves higher taxonomy accuracy and better sample-level classification than baselines on 11 datasets, and code is released, demonstrating robust cross-sample labeling without pooling or explicit batch correction.

Abstract

The rapid emergence of single-cell data has facilitated the study of many different biological conditions at the cellular level. Cluster analysis has been widely applied to identify cell types, capturing the essential patterns of the original data in a much more concise form. One challenge in the cluster analysis of cells is matching clusters extracted from datasets of different origins or conditions. Many existing algorithms cannot recognize new cell types present in only one of the two samples when establishing a correspondence between clusters obtained from two samples. Additionally, when there are more than two samples, it is advantageous to align clusters across all samples simultaneously rather than performing pairwise alignment. Our approach aims to construct a taxonomy for cell clusters across all samples to better annotate these clusters and effectively extract features for downstream analysis. A new system for constructing cell-type taxonomy has been developed by combining the technique of Optimal Transport with Relaxed Marginal Constraints (OT-RMC) and the simultaneous alignment of clusters across multiple samples. OT-RMC allows us to address challenges that arise when the proportions of clusters vary substantially between samples or when some clusters do not appear in all the samples. Experiments on more than twenty datasets demonstrate that the taxonomy constructed by this new system can yield highly accurate annotation of cell types. Additionally, sample-level features extracted based on the taxonomy result in accurate classification of samples.

Paper Structure

This paper contains 14 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: A schematic representation of the Multisample OT Taxonomy (MOTT) system. The taxonomy hierarchically organizes clusters across samples, identifying which clusters correspond to the same cell type and illustrating the relationships between different cell types based on their similarity.
  • Figure 2: t-SNE plots showing data grouped by ground truth cell types (left column) and by sample (right column). (a) Zhao Spleen dataset, (b) Baron Pancreas dataset. The Zhao Spleen dataset, which only includes cells from the spleen, was used instead of the entire Zhao dataset because the latter contains over $60,000$ points, making it too large for effective visualization. The composition of cell types in different samples, both in terms of existing cell types and their population sizes, varies substantially. This observation highlights the challenge of constructing a taxonomy for cell clusters across samples.
  • Figure 3: An example taxonomy created by MOTT for the Segerstolpe dataset. The Segerstolpe dataset includes ten samples and, collectively, 11 cell types. Correspondingly we use the taxonomy to generate 11 meta-clusters. Each meta-cluster is labeled according to the most common cell type among its constituent clusters. Beneath each meta-cluster, we label its chosen cell type, and in the second line, show the number of clusters in this meta-cluster and the percentage of clusters assigned with the correct cell type. Cell type "Endoth" stands for Endothelial, "Co-exp." stands for Co-expression, and "MHC II" stands for "MHC Class II". Only two meta-clusters, Epsilon and MHC Class II, include cluster cell types other than their chosen labels. The horizontal dashed line indicates the cut-off level at which eleven meta-clusters are formed.
  • Figure 4: Accuracy achieved by four methods for identifying cell types based on the taxonomy. OT-RMC-SA corresponds to the MOTT system. The accuracy is measured by ARI, cluster-level accuracy $\zeta_{cls}$, and cell-level accuracy $\zeta_{cell}$. "SS" stands for simulated samples. The number of simulated samples in a dataset has a default value of 20 unless specified in the parenthesis. If "SL" is indicated, the samples were simulated by randomly dividing cells within different cell conditions separately; otherwise, the samples were generated by randomly dividing the entire data. Z. Human stands for Zilionis Human and Z. Mouse stands for Zilionis Mouse. Some datasets had two sets of ground truth labels; coarse indicates major cell types and fine cell subtypes.
  • Figure 5: Boxplots showing the differences in performance between pairs of real sample and simulated sample datasets. Five datasets are analyzed: Segerstolpe, Zhao (fine cell types), Zilionis Human (fine cell types), He Organ, and Baron. Each simulated dataset (generated by the SS scheme) consists of 20 samples, except for the Zilionis Human dataset, which consists of 40 samples. Performance is evaluated using three metrics: ARI, cluster-level accuracy $\zeta_{cls}$, and cell-level accuracy $\zeta_{cell}$. A positive difference indicates better performance on the simulated sample datasets. In nearly all cases across different dataset pairs and methods, the simulated sample datasets show superior performance compared to their real sample counterparts. Notably, OT-RMC-RA and OT-RMC-SA exhibit smaller medians and ranges, indicating that these methods are less sensitive to how samples are generated.
  • ...and 1 more figures