Table of Contents
Fetching ...

Scalable unsupervised alignment of general metric and non-metric structures

Sanketh Vedula, Valentino Maiorca, Lorenzo Basile, Francesco Locatello, Alex Bronstein

TL;DR

The alignment of metric structures is considered as a discrete Gromov-Wasserstein problem and instead of solving the QAP directly, this work proposes to learn a related well-scalable linear assignment problem (LAP) whose solution is also a minimizer of the QAP.

Abstract

Aligning data from different domains is a fundamental problem in machine learning with broad applications across very different areas, most notably aligning experimental readouts in single-cell multiomics. Mathematically, this problem can be formulated as the minimization of disagreement of pair-wise quantities such as distances and is related to the Gromov-Hausdorff and Gromov-Wasserstein distances. Computationally, it is a quadratic assignment problem (QAP) that is known to be NP-hard. Prior works attempted to solve the QAP directly with entropic or low-rank regularization on the permutation, which is computationally tractable only for modestly-sized inputs, and encode only limited inductive bias related to the domains being aligned. We consider the alignment of metric structures formulated as a discrete Gromov-Wasserstein problem and instead of solving the QAP directly, we propose to learn a related well-scalable linear assignment problem (LAP) whose solution is also a minimizer of the QAP. We also show a flexible extension of the proposed framework to general non-metric dissimilarities through differentiable ranks. We extensively evaluate our approach on synthetic and real datasets from single-cell multiomics and neural latent spaces, achieving state-of-the-art performance while being conceptually and computationally simple.

Scalable unsupervised alignment of general metric and non-metric structures

TL;DR

The alignment of metric structures is considered as a discrete Gromov-Wasserstein problem and instead of solving the QAP directly, this work proposes to learn a related well-scalable linear assignment problem (LAP) whose solution is also a minimizer of the QAP.

Abstract

Aligning data from different domains is a fundamental problem in machine learning with broad applications across very different areas, most notably aligning experimental readouts in single-cell multiomics. Mathematically, this problem can be formulated as the minimization of disagreement of pair-wise quantities such as distances and is related to the Gromov-Hausdorff and Gromov-Wasserstein distances. Computationally, it is a quadratic assignment problem (QAP) that is known to be NP-hard. Prior works attempted to solve the QAP directly with entropic or low-rank regularization on the permutation, which is computationally tractable only for modestly-sized inputs, and encode only limited inductive bias related to the domains being aligned. We consider the alignment of metric structures formulated as a discrete Gromov-Wasserstein problem and instead of solving the QAP directly, we propose to learn a related well-scalable linear assignment problem (LAP) whose solution is also a minimizer of the QAP. We also show a flexible extension of the proposed framework to general non-metric dissimilarities through differentiable ranks. We extensively evaluate our approach on synthetic and real datasets from single-cell multiomics and neural latent spaces, achieving state-of-the-art performance while being conceptually and computationally simple.
Paper Structure (20 sections, 14 equations, 5 figures)

This paper contains 20 sections, 14 equations, 5 figures.

Figures (5)

  • Figure 1: The proposed solver generalizes to unseen samples and scales to large-sample sizes post-training. In both top and bottom experiments, $\mathcal{X}$ and $\mathcal{Y}$ are ViT embeddings. The entropic GW solver can only operate in the transductive regime and runs out of memory for $N>25000$.
  • Figure 2: Simulated annealing of $\epsilon$ and spectral geometric regularization are effective in stabilizing the solver and improving the accuracy of the assignment. Left: simulated annealing schedule used. Middle: distribution of the alignment error (measured as FOSCTTM) over $20$ runs with and without $\epsilon$-annealing. Right: distribution of the alignment error with and without the spectral geometric regularization of the transport cost.
  • Figure 3: Qualitative and quantitative results on the human bone marrow single-cell dataset. Top plots depict the UMAP of the translated cells colored by domain (left) and by the cell type (right).Bottom plots report the FOCSTTM metrics for $\mathcal{Y}$ projected onto $\mathcal{X}$ (left) and $\mathcal{X}$ projected onto $\mathcal{Y}$ (right).
  • Figure 4: Qualitative evaluation of the proposed GW solver in inductive setting. The plot depicts the assignment produced by our distance-based GW solver (Eq. \ref{['eq:final_objective_ours']}) on a new set of samples.
  • Figure 5: Qualitative and quantitative results on the scSNARE-seq dataset. Left and middle: Aligned samples from ATAC and RNAl, colored by the domains (ATAC: black, RNA: red) and cell types, respectively. Right: the sorted FOCSTTM plot, a quantitative metric measuring the quality of the assignment.