Table of Contents
Fetching ...

Multimodal data integration and cross-modal querying via orchestrated approximate message passing

Sagnik Nandy, Zongming Ma

TL;DR

This paper tackles multimodal data integration for atlas construction in which multiple high-dimensional and low-dimensional views share subject-specific latent factors. It introduces OrchAMP, a data-driven, orchestrated AMP algorithm that jointly recovers latent factors across modalities and provides asymptotically valid prediction sets for query subjects with partial observations. The authors prove asymptotic normality and Bayes-optimality of the estimators under a fixed-point state-evolution framework, and establish prediction-set coverage guarantees via Glivenko–Cantelli arguments. Empirical validation on synthetic data and a tri-modal TEA-seq single-cell dataset demonstrates competitive signal recovery and informative, calibrated uncertainty quantification for atlas querying. Overall, OrchAMP offers a principled, scalable approach to cross-modal integration with theoretical guarantees and practical applicability to single-cell multi-omics.

Abstract

The need for multimodal data integration arises naturally when multiple complementary sets of features are measured on the same sample. Under a dependent multifactor model, we develop a fully data-driven orchestrated approximate message passing algorithm for integrating information across these feature sets to achieve statistically optimal signal recovery. In practice, these reference data sets are often queried later by new subjects that are only partially observed. Leveraging on asymptotic normality of estimates generated by our data integration method, we further develop an asymptotically valid prediction set for the latent representation of any such query subject. We demonstrate the prowess of both the data integration and the prediction set construction algorithms on both synthetic examples and real world single-cell datasets.

Multimodal data integration and cross-modal querying via orchestrated approximate message passing

TL;DR

This paper tackles multimodal data integration for atlas construction in which multiple high-dimensional and low-dimensional views share subject-specific latent factors. It introduces OrchAMP, a data-driven, orchestrated AMP algorithm that jointly recovers latent factors across modalities and provides asymptotically valid prediction sets for query subjects with partial observations. The authors prove asymptotic normality and Bayes-optimality of the estimators under a fixed-point state-evolution framework, and establish prediction-set coverage guarantees via Glivenko–Cantelli arguments. Empirical validation on synthetic data and a tri-modal TEA-seq single-cell dataset demonstrates competitive signal recovery and informative, calibrated uncertainty quantification for atlas querying. Overall, OrchAMP offers a principled, scalable approach to cross-modal integration with theoretical guarantees and practical applicability to single-cell multi-omics.

Abstract

The need for multimodal data integration arises naturally when multiple complementary sets of features are measured on the same sample. Under a dependent multifactor model, we develop a fully data-driven orchestrated approximate message passing algorithm for integrating information across these feature sets to achieve statistically optimal signal recovery. In practice, these reference data sets are often queried later by new subjects that are only partially observed. Leveraging on asymptotic normality of estimates generated by our data integration method, we further develop an asymptotically valid prediction set for the latent representation of any such query subject. We demonstrate the prowess of both the data integration and the prediction set construction algorithms on both synthetic examples and real world single-cell datasets.
Paper Structure (68 sections, 20 theorems, 284 equations, 8 figures, 6 tables, 2 algorithms)

This paper contains 68 sections, 20 theorems, 284 equations, 8 figures, 6 tables, 2 algorithms.

Key Result

Proposition 5.1

For all $h \in [m]$, consider the matrices $\widebar{\bm X}_h$ and their best rank $r_h$ approximations given by $\frac{1}{N}{\bm U}^{\mathrm{pc}}_{0,h}\bm D_{0,h} ({\bm V}^{\mathrm{pc}}_{0,h})^\top$. Let the matrices $\{\bm S^{L,\mathrm{pc}}_{0,h},\bm \Sigma^{L,\mathrm{pc}}_{0,h},\bm S^{R,\mathrm{p Here, $\{Y_{0,h}:h\in [m]\}$ and $\{\widetilde{Y}_{0,\ell}:\ell\in [\widetilde{m}] \}$ are defined

Figures (8)

  • Figure 1: UMAPs of PBMC atlases from $6323$ TEA-seq cells by Seurat WNN (left), OrchAMP (middle), and MOFA+ (right), colored according to original cell type annotations in 10.7554/eLife.63632.
  • Figure 2: Visualization of prediction sets constructed by Algorithm \ref{['alg:pred']}. Top left: OrchAMP cell atlas constructed on $6320$ TEA-seq cells. Cells are colored according to their cell type annotations in 10.7554/eLife.63632. Visualizations of the $95\%$ prediction set ($500$ randomly sampled points from the set in red and atlas cells in grey) when queried by the ATAC observation of a held-out double Negative T cell (top right), its RNA observation (bottom left), and its Protein observation (bottom right), respectively.
  • Figure 3: Scree plots of the empirical singular values of the ATAC measurement matrix $\bm X_1$ (left) and the RNA measurement matrix $\bm X_2$ (right), after preprocessing.
  • Figure 4: Visualization of prediction sets constructed by Algorithm 2. Top left: OrchAMP cell atlas constructed on $6320$ TEA-seq cells. Cells are colored according to their cell type annotations in 10.7554/eLife.63632. Visualizations of the $95\%$ prediction set ($500$ randomly sampled points from the set in red and atlas cells in grey) when queried by the ATAC observation of a held-out CD8 effector cell (top right), its RNA observation (bottom left), and its Protein observation (bottom right), respectively.
  • Figure 5: Visualization of prediction sets constructed by Algorithm 2. Top left: OrchAMP cell atlas constructed on $6320$ TEA-seq cells. Cells are colored according to their cell type annotations in 10.7554/eLife.63632. Visualizations of the $95\%$ prediction set ($500$ randomly sampled points from the set in red and atlas cells in grey) when queried by the ATAC observation of a held-out pre-B cell (top right), its RNA observation (bottom left), and its Protein observation (bottom right), respectively.
  • ...and 3 more figures

Theorems & Definitions (27)

  • Remark 2.1
  • Remark 2.2
  • Proposition 5.1
  • Remark 5.1
  • Lemma 5.1
  • Lemma 5.2
  • Theorem 5.1
  • Corollary 5.1
  • Corollary 5.2
  • Lemma 5.3
  • ...and 17 more