Table of Contents
Fetching ...

scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration

Jianle Sun, Chaoqi Liang, Ran Wei, Peng Zheng, Lei Bai, Wanli Ouyang, Hongliang Yan, Peng Ye

TL;DR

scMRDR addresses unpaired single-cell multi-omics integration by learning a unified latent space via a single encoder-decoder $eta$-VAE that disentangles modality-shared $z_u$ and modality-specific $z_s^{(m)}$ components. It imposes isometric regularization to preserve intra-modality structure, adversarial alignment across omics, and a masked reconstruction loss to handle missing features, enabling scalability to more than two omics. Empirical results demonstrate strong batch correction, modality alignment, and biological signal preservation on two-omics and triple-omics benchmarks, and show practical utility in spatial-omics contexts through spatial-location imputation. Overall, scMRDR offers a flexible, scalable framework for large-scale multi-omics integration and downstream biological discovery.

Abstract

Advances in single-cell sequencing have enabled high-resolution profiling of diverse molecular modalities, while integrating unpaired multi-omics single-cell data remains challenging. Existing approaches either rely on pair information or prior correspondences, or require computing a global pairwise coupling matrix, limiting their scalability and flexibility. In this paper, we introduce a scalable and flexible generative framework called single-cell Multi-omics Regularized Disentangled Representations (scMRDR) for unpaired multi-omics integration. Specifically, we disentangle each cell's latent representations into modality-shared and modality-specific components using a well-designed $β$-VAE architecture, which are augmented with isometric regularization to preserve intra-omics biological heterogeneity, adversarial objective to encourage cross-modal alignment, and masked reconstruction loss strategy to address the issue of missing features across modalities. Our method achieves excellent performance on benchmark datasets in terms of batch correction, modality alignment, and biological signal preservation. Crucially, it scales effectively to large-level datasets and supports integration of more than two omics, offering a powerful and flexible solution for large-scale multi-omics data integration and downstream biological discovery.

scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration

TL;DR

scMRDR addresses unpaired single-cell multi-omics integration by learning a unified latent space via a single encoder-decoder -VAE that disentangles modality-shared and modality-specific components. It imposes isometric regularization to preserve intra-modality structure, adversarial alignment across omics, and a masked reconstruction loss to handle missing features, enabling scalability to more than two omics. Empirical results demonstrate strong batch correction, modality alignment, and biological signal preservation on two-omics and triple-omics benchmarks, and show practical utility in spatial-omics contexts through spatial-location imputation. Overall, scMRDR offers a flexible, scalable framework for large-scale multi-omics integration and downstream biological discovery.

Abstract

Advances in single-cell sequencing have enabled high-resolution profiling of diverse molecular modalities, while integrating unpaired multi-omics single-cell data remains challenging. Existing approaches either rely on pair information or prior correspondences, or require computing a global pairwise coupling matrix, limiting their scalability and flexibility. In this paper, we introduce a scalable and flexible generative framework called single-cell Multi-omics Regularized Disentangled Representations (scMRDR) for unpaired multi-omics integration. Specifically, we disentangle each cell's latent representations into modality-shared and modality-specific components using a well-designed -VAE architecture, which are augmented with isometric regularization to preserve intra-omics biological heterogeneity, adversarial objective to encourage cross-modal alignment, and masked reconstruction loss strategy to address the issue of missing features across modalities. Our method achieves excellent performance on benchmark datasets in terms of batch correction, modality alignment, and biological signal preservation. Crucially, it scales effectively to large-level datasets and supports integration of more than two omics, offering a powerful and flexible solution for large-scale multi-omics data integration and downstream biological discovery.

Paper Structure

This paper contains 23 sections, 13 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Method overview. (a) Multi-omics data integration. The goal is to integrate single-cell data in different modalities into an aligned latent space while preserving biological information and correcting technical noise. (b) Integration via joint dimension reduction (e.g., joint autoencoders). It typically works with paired data (measurements on different omics within the same cell). (c) Integration via manifold alignment between the geometric structures (e.g., KNN distance graphs) of different omics. It does not require paired data, but is typically limited to small-scale datasets involving only two omics modalities. (d) Our framework, based on disentangled representations, is flexible to completely unpaired data and scalable to large datasets with more than two omics.
  • Figure 2: Overview of the proposed scMRDR. We employ $\beta$-VAE to disentangle omics-specific and omics-shared latent representations, and impose isometric loss and adversarial training as regularization to encourage modality integration and bio-conservation.
  • Figure 3: Graphical illumination of the single-cell multi-omics data generative model.
  • Figure 4: Performance comparisons on two-omics integration, where unscaled metrics calculated via scIB are reported.
  • Figure 5: Performance comparisons on two-omics integration with large-scale dataset, where unscaled metrics calculated via scIB are reported. The default preprocessing method 'scglue.data.lsi' for GLUE fails to handle the large-scale data, and substituting it with PCA leads to severe performance degradation, although using 'TruncatedSVD' as an approximation of LSI can alleviate this issue.
  • ...and 9 more figures