Table of Contents
Fetching ...

Entropic Optimal Transport Eigenmaps for Nonlinear Alignment and Joint Embedding of High-Dimensional Datasets

Boris Landa, Yuval Kluger, Rong Ma

TL;DR

This work develops Entropic Optimal Transport (EOT) eigenmaps to align and jointly embed two high-dimensional datasets that share underlying structure but exhibit dataset-specific distortions. By computing the EOT plan W between the datasets and extracting the leading singular vectors, the method yields a common embedding that preserves shared geometry while filtering batch effects; it naturally recovers classical Laplacian eigenmaps when t = 0 and diffusion maps when t is a positive integer. The authors provide theoretical guarantees under a latent manifold model with distortions, showing concentration of W to a population kernel and convergence of the associated operators to a density-weighted Laplacian, enabling robust extraction of the shared manifold structure in high dimensions. Empirically, EOT eigenmaps outperform existing methods in simulated alignment and clustering tasks and demonstrate strong performance in real single-cell data integrations, including multi-omics and cross-modality analyses, with publicly available R/Python implementations. These results have practical impact for data integration and multi-view analyses in biology and beyond, offering a principled, interpretable framework with theoretical support for challenging cross-study distortions.

Abstract

Embedding high-dimensional data into a low-dimensional space is an indispensable component of data analysis. In numerous applications, it is necessary to align and jointly embed multiple datasets from different studies or experimental conditions. Such datasets may share underlying structures of interest but exhibit individual distortions, resulting in misaligned embeddings using traditional techniques. In this work, we propose \textit{Entropic Optimal Transport (EOT) eigenmaps}, a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees. Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure and align the datasets accordingly in a common embedding space. We interpret our approach as an inter-data variant of the classical Laplacian eigenmaps and diffusion maps embeddings, showing that it enjoys many favorable analogous properties. We then analyze a data-generative model where two observed high-dimensional datasets share latent variables on a common low-dimensional manifold, but each dataset is subject to data-specific translation, scaling, nuisance structures, and noise. We show that in a high-dimensional asymptotic regime, the EOT plan recovers the shared manifold structure by approximating a kernel function evaluated at the locations of the latent variables. Subsequently, we provide a geometric interpretation of our embedding by relating it to the eigenfunctions of population-level operators encoding the density and geometry of the shared manifold. Finally, we showcase the performance of our approach for data integration and embedding through simulations and analyses of real-world biological data, demonstrating its advantages over alternative methods in challenging scenarios.

Entropic Optimal Transport Eigenmaps for Nonlinear Alignment and Joint Embedding of High-Dimensional Datasets

TL;DR

This work develops Entropic Optimal Transport (EOT) eigenmaps to align and jointly embed two high-dimensional datasets that share underlying structure but exhibit dataset-specific distortions. By computing the EOT plan W between the datasets and extracting the leading singular vectors, the method yields a common embedding that preserves shared geometry while filtering batch effects; it naturally recovers classical Laplacian eigenmaps when t = 0 and diffusion maps when t is a positive integer. The authors provide theoretical guarantees under a latent manifold model with distortions, showing concentration of W to a population kernel and convergence of the associated operators to a density-weighted Laplacian, enabling robust extraction of the shared manifold structure in high dimensions. Empirically, EOT eigenmaps outperform existing methods in simulated alignment and clustering tasks and demonstrate strong performance in real single-cell data integrations, including multi-omics and cross-modality analyses, with publicly available R/Python implementations. These results have practical impact for data integration and multi-view analyses in biology and beyond, offering a principled, interpretable framework with theoretical support for challenging cross-study distortions.

Abstract

Embedding high-dimensional data into a low-dimensional space is an indispensable component of data analysis. In numerous applications, it is necessary to align and jointly embed multiple datasets from different studies or experimental conditions. Such datasets may share underlying structures of interest but exhibit individual distortions, resulting in misaligned embeddings using traditional techniques. In this work, we propose \textit{Entropic Optimal Transport (EOT) eigenmaps}, a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees. Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure and align the datasets accordingly in a common embedding space. We interpret our approach as an inter-data variant of the classical Laplacian eigenmaps and diffusion maps embeddings, showing that it enjoys many favorable analogous properties. We then analyze a data-generative model where two observed high-dimensional datasets share latent variables on a common low-dimensional manifold, but each dataset is subject to data-specific translation, scaling, nuisance structures, and noise. We show that in a high-dimensional asymptotic regime, the EOT plan recovers the shared manifold structure by approximating a kernel function evaluated at the locations of the latent variables. Subsequently, we provide a geometric interpretation of our embedding by relating it to the eigenfunctions of population-level operators encoding the density and geometry of the shared manifold. Finally, we showcase the performance of our approach for data integration and embedding through simulations and analyses of real-world biological data, demonstrating its advantages over alternative methods in challenging scenarios.
Paper Structure (29 sections, 9 theorems, 111 equations, 6 figures, 1 algorithm)

This paper contains 29 sections, 9 theorems, 111 equations, 6 figures, 1 algorithm.

Key Result

Proposition 1

Under the constraints eq: constraints for embedding, the function $J(\mathcal{X}^{'},\mathcal{Y}^{'})$ from eq: cost function for embedding is minimized by $(\widetilde{\mathcal{X}},\widetilde{\mathcal{Y}})$ from eq: embedding formula with $t=0$.

Figures (6)

  • Figure 1: Datasets $\mathcal{X}$ and $\mathcal{Y}$ in $\mathbb{R}^3$ (left) and their joint embedding $(\widetilde{\mathcal{X}},\widetilde{\mathcal{Y}})$ in $\mathbb{R}^2$ (right) using EOT eigenmaps. The dataset $\mathcal{X}$ contains $m=1,000$ samples from an annulus in the XY-plane, while $\mathcal{Y}$ contains $n=5,000$ samples from a shifted and scaled version of the same annulus with additional variation along the Z-axis. The transport plan $W\in\mathbb{R}^{m\times n}$ encodes the cross-data pairwise affinities between the points of $\mathcal{X}$ and $\mathcal{Y}$. Our proposed method embeds $\mathcal{X}$ ($\mathcal{Y}$) into $\mathbb{R}^2$ using the second and third left (right) singular vectors of $W$, up to a suitable scaling (see eq. \ref{['eq: embedding formula']} in Section \ref{['sec: the proposed method']} with $q=2$ and $t=0$). On the left, the points of $\mathcal{X}$ ($\mathcal{Y}$) are colored according to the second left (right) singular vector of $W$, highlighting the correspondence between the datasets. Our joint embedding captures the underlying structure shared between the datasets in the XY-plane and aligns the datasets accordingly.
  • Figure 2: Comparison of nine integration methods based on simulations. (a) Visualization of the first three coordinates of each integrated low-dimensional embedding under Setting 1 of noisy manifold alignment experiments with $\tau=8$, where the data points are colored according to datasets; (b) Closer comparison of the best four methods in (a) and (c) based on the first two coordinates of their integrated low-dimensional embeddings; (c) Numerical evaluation of different methods in terms of noisy manifold alignment across the two simulation settings; (d) Numerical evaluate of different methods in terms of joint clustering performance.
  • Figure 3: Integrative analyses of single-cell omics data. (a) Comparison of eight methods on three single-cell omics data integration tasks measured by two metrics, whose medians across a range of embedding dimensions $q$ are shown here, where a higher value indicates better performance. (b) UMAP visualization of the joint low-dimensional embedding of genes and accessible chromatin regions, colored according to feature modalities. We select four clusters of features (m1-m4), identified by DBSCAN and each consisting of a regulatory module, for closer examination. (c) Scatter plots of the average expression (x-axis) of the genes and the average level of accessibility or gene activity (y-axis) of the accessible regions contained in each module for all the cells, where the cells are colored according to their cell type annotations.
  • Figure S1: Comparison of Davies-Bouldin index of nine integration methods in various simulations. Left and Middle: simulations for noisy manifold alignment, setting 1 (left) and setting 2 (middle). Right: simulations for joint clustering. Our results indicate superior performance of the proposed methods ("EOT-0" and "EOT-1") in aligning the latent structures.
  • Figure S2: Comparison of mean Silhouette index (Top row) and neighbor purity score (Bottom row) of eight integration methods in three pairs of single-cell omics data, where each boxplot contains the metrics for each method across a range of embedding dimensions $q$ (from 2 to 20). Left: scRNA-seq data of human PBMCs under different experimental conditions. Middle: scATAC-seq data of mouse brain cells from different studies. Right: scRNA-seq data of human PBMCs from different samples. Our results indicate the advantages of the proposed method ("EOT") in identifying and aligning the different cell types across datasets, and its robustness with respect to the choice of embedding dimension $q$.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Theorem 4
  • Corollary 5
  • Corollary 6
  • Lemma 7: Boundedness of scaling factors landa2022scaling
  • Lemma 8: Stability of scaling factors under approximate scaling
  • proof
  • Lemma 9
  • ...and 2 more