Constructing Cell-type Taxonomy by Optimal Transport with Relaxed Marginal Constraints
Sebastian Pena, Lin Lin, Jia Li
TL;DR
The paper addresses cross-sample cell-type annotation in scRNA-seq by constructing a taxonomy of clusters across multiple samples, accommodating missing and new cell types. It introduces Multisample OT Taxonomy (MOTT), which combines Optimal Transport with Relaxed Marginal Constraints (OT-RMC) and Simultaneous Alignment to compute cross-sample cluster similarities, forming an overall similarity matrix $\mathbf{B}$ that is converted to a distance matrix $\mathbf{A}$ via $-\log(\cdot)$ and then partitioned with Ward linkage into meta-clusters. Clusters are modeled as Gaussians $N(\mu_k^{(i)}, \Sigma_k^{(i)})$ and cross-cluster distances use the squared-Wasserstein distance $D_W^2$; OT-RMC yields a matching $\mathbf{W}^*$ with induced cluster proportions and allows unmatched clusters through gap penalties $\lambda L(\mathbf{g})$. Empirically, MOTT achieves higher taxonomy accuracy and better sample-level classification than baselines on 11 datasets, and code is released, demonstrating robust cross-sample labeling without pooling or explicit batch correction.
Abstract
The rapid emergence of single-cell data has facilitated the study of many different biological conditions at the cellular level. Cluster analysis has been widely applied to identify cell types, capturing the essential patterns of the original data in a much more concise form. One challenge in the cluster analysis of cells is matching clusters extracted from datasets of different origins or conditions. Many existing algorithms cannot recognize new cell types present in only one of the two samples when establishing a correspondence between clusters obtained from two samples. Additionally, when there are more than two samples, it is advantageous to align clusters across all samples simultaneously rather than performing pairwise alignment. Our approach aims to construct a taxonomy for cell clusters across all samples to better annotate these clusters and effectively extract features for downstream analysis. A new system for constructing cell-type taxonomy has been developed by combining the technique of Optimal Transport with Relaxed Marginal Constraints (OT-RMC) and the simultaneous alignment of clusters across multiple samples. OT-RMC allows us to address challenges that arise when the proportions of clusters vary substantially between samples or when some clusters do not appear in all the samples. Experiments on more than twenty datasets demonstrate that the taxonomy constructed by this new system can yield highly accurate annotation of cell types. Additionally, sample-level features extracted based on the taxonomy result in accurate classification of samples.
