Table of Contents
Fetching ...

Unsupervised Evolutionary Cell Type Matching via Entropy-Minimized Optimal Transport

Mu Qiao

TL;DR

This work tackles the problem of identifying evolutionary correspondences between cell types across species in an unsupervised manner. The authors introduce OT-MESH, which combines entropy-regularized optimal transport with the Minimize Entropy of Sinkhorn (MESH) refinement to produce sparse, interpretable cross-species cell-type mappings. Using gene-centroid representations built from SNR-selected features and a cosine-based cost between species, OT-MESH achieves near-constrained accuracy with substantial computational speed, outperforming or matching baselines across synthetic scalability tests and retinal BC/RGC datasets from mouse and macaque. The method demonstrates strong robustness to noise and uncovers both known and novel cross-species homologies, including experimentally validated predictions, highlighting its practical impact for large-scale comparative genomics and evolutionary cell biology.

Abstract

Identifying evolutionary correspondences between cell types across species is a fundamental challenge in comparative genomics and evolutionary biology. Existing approaches often rely on either reference-based matching, which imposes asymmetry by designating one species as the reference, or projection-based matching, which may increase computational complexity and obscure biological interpretability at the cell-type level. Here, we present OT-MESH, an unsupervised computational framework leveraging entropy-regularized optimal transport (OT) to systematically determine cross-species cell type homologies. Our method uniquely integrates the Minimize Entropy of Sinkhorn (MESH) technique to refine the OT plan, transforming diffuse transport matrices into sparse, interpretable correspondences. Through systematic evaluation on synthetic datasets, we demonstrate that OT-MESH achieves near-optimal matching accuracy with computational efficiency, while maintaining remarkable robustness to noise. Compared to other OT-based methods like RefCM, OT-MESH provides speedup while achieving comparable accuracy. Applied to retinal bipolar cells (BCs) and retinal ganglion cells (RGCs) from mouse and macaque, OT-MESH accurately recovers known evolutionary relationships and uncovers novel correspondences, one of which was independently validated experimentally. Thus, our framework offers a principled, scalable, and interpretable solution for evolutionary cell type mapping, facilitating deeper insights into cellular specialization and conservation across species.

Unsupervised Evolutionary Cell Type Matching via Entropy-Minimized Optimal Transport

TL;DR

This work tackles the problem of identifying evolutionary correspondences between cell types across species in an unsupervised manner. The authors introduce OT-MESH, which combines entropy-regularized optimal transport with the Minimize Entropy of Sinkhorn (MESH) refinement to produce sparse, interpretable cross-species cell-type mappings. Using gene-centroid representations built from SNR-selected features and a cosine-based cost between species, OT-MESH achieves near-constrained accuracy with substantial computational speed, outperforming or matching baselines across synthetic scalability tests and retinal BC/RGC datasets from mouse and macaque. The method demonstrates strong robustness to noise and uncovers both known and novel cross-species homologies, including experimentally validated predictions, highlighting its practical impact for large-scale comparative genomics and evolutionary cell biology.

Abstract

Identifying evolutionary correspondences between cell types across species is a fundamental challenge in comparative genomics and evolutionary biology. Existing approaches often rely on either reference-based matching, which imposes asymmetry by designating one species as the reference, or projection-based matching, which may increase computational complexity and obscure biological interpretability at the cell-type level. Here, we present OT-MESH, an unsupervised computational framework leveraging entropy-regularized optimal transport (OT) to systematically determine cross-species cell type homologies. Our method uniquely integrates the Minimize Entropy of Sinkhorn (MESH) technique to refine the OT plan, transforming diffuse transport matrices into sparse, interpretable correspondences. Through systematic evaluation on synthetic datasets, we demonstrate that OT-MESH achieves near-optimal matching accuracy with computational efficiency, while maintaining remarkable robustness to noise. Compared to other OT-based methods like RefCM, OT-MESH provides speedup while achieving comparable accuracy. Applied to retinal bipolar cells (BCs) and retinal ganglion cells (RGCs) from mouse and macaque, OT-MESH accurately recovers known evolutionary relationships and uncovers novel correspondences, one of which was independently validated experimentally. Thus, our framework offers a principled, scalable, and interpretable solution for evolutionary cell type mapping, facilitating deeper insights into cellular specialization and conservation across species.

Paper Structure

This paper contains 42 sections, 17 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the OT-MESH framework for cross-species cell type matching. Starting from cell type centroids computed using SNR-selected genes, a cost matrix is constructed based on cosine distances between centroids from two species. Standard entropy-regularized OT yields an initial correspondence matrix $\mathbf{W}$ that is typically diffuse and difficult to interpret biologically. The MESH procedure iteratively refines the cost matrix through entropy minimization, transforming the diffuse transport plan into a sparse, interpretable correspondence matrix $\mathbf{W}^*$ that clearly identifies evolutionarily related cell types between species.
  • Figure 1: Supplementary Figure 1: Parameter selection of OT-MESH, illustrated by the example of the correspondence between macaque peripheral and foveal BC types. A) Entropy versus MESH learning rate ($\lambda$) for $\alpha = 1.0$, showing curves for different numbers of MESH iterations (T). The asterisk marks the elbow point at $\lambda = 5.0$, where further increases in learning rate yield diminishing returns in entropy reduction. B) Entropy versus MESH iterations (T) for $\alpha = 1.0$, showing curves for different learning rates. The asterisk indicates the elbow point at T = 8, beyond which additional iterations provide minimal entropy reduction. C) Transport cost versus regularization parameter ($\alpha$) evaluated at the elbow points identified for each $\alpha$ value. The asterisk marks the optimal parameter combination ($\alpha = 1.0$, $\lambda = 5.0$, T = 8) that minimizes transport cost while maintaining high sparsity. This systematic approach ensures reproducible parameter selection that balances biological interpretability (through sparsity) with fidelity to the underlying gene expression similarities (through transport cost minimization).
  • Figure 2: Scalability analysis across varying numbers of cell types. A) Runtime scaling of different methods. B) Matching accuracy (ARI) over cell type number of different methods. C) Solution entropy scaling of different methods. Different methods are indicated in the legend of panel C. Error bars represent standard deviation across three independent runs.
  • Figure 2: Supplementary Figure 2: Robustness analysis across varying noise levels. A) Matching accuracy (ARI) over noise level of different methods. B) Solution entropy over noise level of different methods. Different methods are indicated in the legend of panel A. Error bars represent standard deviation across three independent runs.
  • Figure 3: Comparison of correspondence matrices for macaque peripheral to foveal BC type mapping. A) XGBoost shows diagonal structure with notable off-diagonal noise. B) Harmony+1NN improves alignment with cleaner structure. C) Standard OT produces a diffuse, uninterpretable mapping. D) OT-MESH yields perfect diagonal structure with high-confidence 1:1 matchings without requiring prior constraints. E) RefCM-Strict achieves clean diagonal through enforced 1-to-1 constraints. Correspondence matrices are normalized so that the sum of all their entries is one.
  • ...and 2 more figures