Multimodal data integration and cross-modal querying via orchestrated approximate message passing
Sagnik Nandy, Zongming Ma
TL;DR
This paper tackles multimodal data integration for atlas construction in which multiple high-dimensional and low-dimensional views share subject-specific latent factors. It introduces OrchAMP, a data-driven, orchestrated AMP algorithm that jointly recovers latent factors across modalities and provides asymptotically valid prediction sets for query subjects with partial observations. The authors prove asymptotic normality and Bayes-optimality of the estimators under a fixed-point state-evolution framework, and establish prediction-set coverage guarantees via Glivenko–Cantelli arguments. Empirical validation on synthetic data and a tri-modal TEA-seq single-cell dataset demonstrates competitive signal recovery and informative, calibrated uncertainty quantification for atlas querying. Overall, OrchAMP offers a principled, scalable approach to cross-modal integration with theoretical guarantees and practical applicability to single-cell multi-omics.
Abstract
The need for multimodal data integration arises naturally when multiple complementary sets of features are measured on the same sample. Under a dependent multifactor model, we develop a fully data-driven orchestrated approximate message passing algorithm for integrating information across these feature sets to achieve statistically optimal signal recovery. In practice, these reference data sets are often queried later by new subjects that are only partially observed. Leveraging on asymptotic normality of estimates generated by our data integration method, we further develop an asymptotically valid prediction set for the latent representation of any such query subject. We demonstrate the prowess of both the data integration and the prediction set construction algorithms on both synthetic examples and real world single-cell datasets.
