Table of Contents
Fetching ...

Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations

Tianjian Yang, Wei Vivian Li

TL;DR

GPCCA addresses multi-modal data integration with partial observations by extending probabilistic canonical correlation analysis to $R$ modalities and learning a shared $d$-dimensional latent embedding $Z$. It estimates model parameters via an MAR-aware EM algorithm and stabilizes high-dimensional covariance through ridge regularization, selecting $d$ with a consensus-driven approach. The method is validated on synthetic three-modality data, a four-modality handwritten digits dataset, and TCGA multi-omics data, showing robust clustering and survival-predictive usefulness, and it is released as an open-source R package. Overall, GPCCA provides a practical, scalable tool for integrative analyses in bioinformatics and multi-view imaging, capable of imputing missing data and uncovering informative cross-modal patterns.

Abstract

Background: The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. Results: We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. The model demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. Conclusion: GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/GPCCA.

Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations

TL;DR

GPCCA addresses multi-modal data integration with partial observations by extending probabilistic canonical correlation analysis to modalities and learning a shared -dimensional latent embedding . It estimates model parameters via an MAR-aware EM algorithm and stabilizes high-dimensional covariance through ridge regularization, selecting with a consensus-driven approach. The method is validated on synthetic three-modality data, a four-modality handwritten digits dataset, and TCGA multi-omics data, showing robust clustering and survival-predictive usefulness, and it is released as an open-source R package. Overall, GPCCA provides a practical, scalable tool for integrative analyses in bioinformatics and multi-view imaging, capable of imputing missing data and uncovering informative cross-modal patterns.

Abstract

Background: The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. Results: We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. The model demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. Conclusion: GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/GPCCA.

Paper Structure

This paper contains 14 sections, 14 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An illustration of the GPCCA model applied to a three-modality dataset. White boxes indicate missing data. In this example, Modality 2 is fully observed, while Modality 1 has randomly missing values, and Modality 3 exhibits modality-wise missingness.
  • Figure 2: Comparison of clustering performance (ARI) in simulation study. a. Case A: normal data (MCAR). b. Case B: heavy-tailed data (MCAR). c. Case C: normal data (MNAR). d. Case D: correlated modalities (MCAR). GPCCA with different regularization parameters are denoted as GPCCA-2/3 and GPCCA-1/2. PPCA applied to concatenated modalities is denoted as PPCA-123. PPCA applied to individual modalities is denoted as PPCA-1, PPCA-2, and PPCA-3, respectively. Methods are ordered from high to low, based on their average ARI across all scenarios with different missing levels and correlation levels. In Cases A, B, and D, the horizontal labels on the bottom of the heatmap represent missing rates; in Case C, the horizontal labels represent the baseline probability ($p$) of modality-wise missingness (see Methods).
  • Figure 3: UMAP projections based on PPCA applied to every single modality. a. Samples are colored by the inferred clusters. b. Samples are colored by the true group labels. The demonstrated parameter settings are as follows: Cases A, B, and D ($20\%$ missing rate and $\rho = 0.7$); Case C (modality missingness with $p = 0.1$ and $\rho = 0.7$).
  • Figure 4: UMAP projections based on multi-modality analysis by PPCA, MOFA, and GPCCA. a. Samples are colored by the inferred clusters. b. Samples are colored by the true group labels. The demonstrated parameter settings are as follows: Cases A, B, and D ($20\%$ missing rate and $\rho = 0.7$); Case C (modality missingness with $p = 0.1$ and $\rho = 0.7$). For GPCCA results, the model with better ARI score is used for visualization: $\lambda = 2/3$ for Case A, B and D and $\lambda = 1/2$ for Case C.
  • Figure 5: UMAP projections of PPCA results based on the four individual modalities and their concatenated data. a. Fourier coefficients. b. Profile correlations. c. Karhunen-Loève coefficients. d. Zernike moments. e. Concatenated data of the four modalities. Samples are colored by the true class labels.
  • ...and 2 more figures