Table of Contents
Fetching ...

GENOT: Entropic (Gromov) Wasserstein Flow Matching with Applications to Single-Cell Genomics

Dominik Klein, Théo Uscidda, Fabian Theis, Marco Cuturi

TL;DR

This approach learns stochastic maps, allows for any cost function, relaxes mass conservation constraints and integrates quadratic solvers to tackle the complex challenges posed by the (Fused) Gromov-Wasserstein problem, illustrating significant potential for enhancing therapeutic strategies.

Abstract

Single-cell genomics has significantly advanced our understanding of cellular behavior, catalyzing innovations in treatments and precision medicine. However, single-cell sequencing technologies are inherently destructive and can only measure a limited array of data modalities simultaneously. This limitation underscores the need for new methods capable of realigning cells. Optimal transport (OT) has emerged as a potent solution, but traditional discrete solvers are hampered by scalability, privacy, and out-of-sample estimation issues. These challenges have spurred the development of neural network-based solvers, known as neural OT solvers, that parameterize OT maps. Yet, these models often lack the flexibility needed for broader life science applications. To address these deficiencies, our approach learns stochastic maps (i.e. transport plans), allows for any cost function, relaxes mass conservation constraints and integrates quadratic solvers to tackle the complex challenges posed by the (Fused) Gromov-Wasserstein problem. Utilizing flow matching as a backbone, our method offers a flexible and effective framework. We demonstrate its versatility and robustness through applications in cell development studies, cellular drug response modeling, and cross-modality cell translation, illustrating significant potential for enhancing therapeutic strategies.

GENOT: Entropic (Gromov) Wasserstein Flow Matching with Applications to Single-Cell Genomics

TL;DR

This approach learns stochastic maps, allows for any cost function, relaxes mass conservation constraints and integrates quadratic solvers to tackle the complex challenges posed by the (Fused) Gromov-Wasserstein problem, illustrating significant potential for enhancing therapeutic strategies.

Abstract

Single-cell genomics has significantly advanced our understanding of cellular behavior, catalyzing innovations in treatments and precision medicine. However, single-cell sequencing technologies are inherently destructive and can only measure a limited array of data modalities simultaneously. This limitation underscores the need for new methods capable of realigning cells. Optimal transport (OT) has emerged as a potent solution, but traditional discrete solvers are hampered by scalability, privacy, and out-of-sample estimation issues. These challenges have spurred the development of neural network-based solvers, known as neural OT solvers, that parameterize OT maps. Yet, these models often lack the flexibility needed for broader life science applications. To address these deficiencies, our approach learns stochastic maps (i.e. transport plans), allows for any cost function, relaxes mass conservation constraints and integrates quadratic solvers to tackle the complex challenges posed by the (Fused) Gromov-Wasserstein problem. Utilizing flow matching as a backbone, our method offers a flexible and effective framework. We demonstrate its versatility and robustness through applications in cell development studies, cellular drug response modeling, and cross-modality cell translation, illustrating significant potential for enhancing therapeutic strategies.
Paper Structure (54 sections, 4 theorems, 56 equations, 25 figures, 5 tables, 1 algorithm)

This paper contains 54 sections, 4 theorems, 56 equations, 25 figures, 5 tables, 1 algorithm.

Key Result

Proposition 3.0

Let $\pi^\star_{\varepsilon, \tau}$ be an unbalanced EOT coupling, solution of eq:unbalanced-entropic-kantorovich-problem or eq:unbalanced-entropic-gromov-wasserstein-problem between $\mu \in \mathcal{M}^+(\mathcal{X})$ and $\nu \in \mathcal{M}^+(\mathcal{Y})$. We note $\tilde{\mu} = p_1 \sharp \pi^

Figures (25)

  • Figure 1: Left: What do we do? One task we consider is generating RNA cell profiles from ATAC measurements and an additional cell feature. This is explained in § \ref{['subsec:genot_gw']}, and demonstrated in Fig. \ref{['fig:joint_umap_both_spaces']}. As the cells live on manifolds in two (partially) incomparable spaces , we rely on the Fused Gromov-Wasserstein (FGW) formulation, as described in § \ref{['subsec:fused-genot']}. Here, the incomparable structural information is contained in the ATAC and the RNA measurements, while the comparable information are the cell features. Right: How do we do it? For each $(\textcolor{red}{\mathbf{x}}, \textcolor{BurntOrange}{\mathbf{u}})$ in the support of the source $\textcolor{red}{\bm{\mu}}$, we learn a flow $\phi_1(\cdot|\textcolor{red}{\mathbf{x}}, \textcolor{BurntOrange}{\mathbf{u}})$ from the noise $\textcolor{ForestGreen}{\bm{\rho}}$ to the conditional $\pi^\star_\varepsilon(\cdot|\textcolor{red}{\mathbf{x}}, \textcolor{BurntOrange}{\mathbf{u}})$, whose support lies in that of the target $\bm{\textcolor{blue}{\nu}}$. The flow is multi-modal: It allows sampling structural informations $\textcolor{blue}{\mathbf{y}}$, as well as features $\textcolor{BurntOrange}{\mathbf{v}}$ simultaneously. We highlight this procedure for a specific pair ($\mathbf{x}_1$, $\mathbf{u}_1$), with $\boldsymbol{p}=2$ and $\boldsymbol{q}=3$.
  • Figure 2: Left: Source cell from the early time points (top left) and samples of the conditional distributions of the EOT coupling learned with GENOT for the geodesic cost $d_\mathcal{M}$ (middle) and the $\ell_2^2$ cost (right) projected onto a UMAP mcinnes2018umap, along with biological assessment of the learnt dynamics (TSI score, CT error § \ref{['app:metrics_single_cell']}, Fig. \ref{['fig:pancreas_transitions']}). Right: UMAP colored according to the uncertainty score $\mathrm{cos}\text{-}\mathrm{var}(\pi_\theta^{d_\mathcal{M}}(\cdot|X))$ of each source cell $\mathbf{x}$. Target cells are colored in gray.
  • Figure 3: Left: Accuracy of cellular response predictions of U-GENOT-L for cancer drugs with varying unbalancedness parameter $\tau=\tau_1=\tau_2$. Smaller$\tau$ implies more unbalancedness (3 runs per $\tau$). Right: Mapping a Swiss roll in $\mathbb{R}^3$ ($\mu$) to a spiral in $\mathbb{R}^2$ ($\nu$) with GENOT-Q. Center: Color code tracks where samples from $\mu$ (top) are mapped to (bottom). Right column: samples from $\mu$ (top) and the corresponding conditionals, along with conditional density estimates. The learned \ref{['eq:entropic-gromov-wasserstein-problem']} coupling minimizes the distortion: Points close in support of $\mu$ generate points close in support of $\nu$.
  • Figure 4: Left: Benchmark of GENOT-Q models against discrete GW (GW-LR, App. \ref{['app:competing_methods']}) on translating cells between ATAC space of dim. $d_1$ and RNA space of dim. $d_2$, with performance measured with FOSCTTM score (App. \ref{['app:metrics_single_cell']}) and Sinkhorn divergence between target and predicted target distribution. (left) intra-domain costs $c_\mathcal{X}=c_\mathcal{Y}=\ell_2^2$, (right) geodesic distances $c_\mathcal{X} = d_\mathcal{X}$ and $c_\mathcal{Y} = d_\mathcal{Y}$. We show mean and std across 3 runs. Right: Top: UMAP of transported cells with GENOT-F (colored by cell type) and cells in the target distribution (gray). Cells of the same cell type generate cells which cluster together in RNA space. Bottom: UMAP of transported cells with a GENOT model trained on batch-wise independent couplings, thus not using OT, generating cells which are randomly mixed.
  • Figure 5: Fitting the EOT coupling between two Gaussian mixtures for the Coulomb cost benamou2015numerical$c(\mathbf{x}, \mathbf{y}) = 1/\|\mathbf{x}-\mathbf{y}\|_2$, and $\varepsilon=0.01$, using GENOT and OT-CFM tong2023conditional. For both methods, we hence use mini-batch couplings computed with this cost. We connect paired samples with a line. The EOT coupling pairs source samples and target samples diagonally (left). GENOT (middle) generates samples correctly, while OT-CFM (right) fails to preserve the signal from mini-batch couplings. The data-setting is inspired by debortoli2023augmented
  • ...and 20 more figures

Theorems & Definitions (9)

  • Proposition 3.0: Re-Balancing the unbalanced problems.
  • Proposition 3.0: Pointwise estimation of re-weighting functions.
  • Proposition B.0: Re-Balancing the unbalanced problems.
  • Definition B.1
  • proof : Proof of \ref{['prop:re-balance-eot-problems']}
  • Remark B.2
  • Remark B.3
  • Proposition B.3: Pointwise estimation of re-weighting functions.
  • proof