Table of Contents
Fetching ...

Integrate Any Omics: Towards genome-wide data integration for patient stratification

Shihao Ma, Andy G. X. Zeng, Benjamin Haibe-Kains, Anna Goldenberg, John E Dick, Bo Wang

TL;DR

IntegrAO tackles the critical problem of integrating incomplete multi-omics data for cancer patient stratification by constructing and fusing partially overlapping patient graphs across modalities, then learning unified embeddings with omics-specific Graph Neural Networks. The framework supports both transductive integration and inductive prediction, enabling robust classification of new patients with partial omics data. Across simulated data, AML case studies, and pan-cancer benchmarks, IntegrAO demonstrates superior robustness to missing data, refined subtyping with biological and clinical relevance, and reliable subtype prediction for unseen patients. This modality-agnostic approach has significant implications for precision oncology, offering a scalable, data-efficient path to holistic patient characterization and decision support.

Abstract

High-throughput omics profiling advancements have greatly enhanced cancer patient stratification. However, incomplete data in multi-omics integration presents a significant challenge, as traditional methods like sample exclusion or imputation often compromise biological diversity and dependencies. Furthermore, the critical task of accurately classifying new patients with partial omics data into existing subtypes is commonly overlooked. To address these issues, we introduce IntegrAO (Integrate Any Omics), an unsupervised framework for integrating incomplete multi-omics data and classifying new samples. IntegrAO first combines partially overlapping patient graphs from diverse omics sources and utilizes graph neural networks to produce unified patient embeddings. Our systematic evaluation across five cancer cohorts involving six omics modalities demonstrates IntegrAO's robustness to missing data and its accuracy in classifying new samples with partial profiles. An acute myeloid leukemia case study further validates its capability to uncover biological and clinical heterogeneity in incomplete datasets. IntegrAO's ability to handle heterogeneous and incomplete data makes it an essential tool for precision oncology, offering a holistic approach to patient characterization.

Integrate Any Omics: Towards genome-wide data integration for patient stratification

TL;DR

IntegrAO tackles the critical problem of integrating incomplete multi-omics data for cancer patient stratification by constructing and fusing partially overlapping patient graphs across modalities, then learning unified embeddings with omics-specific Graph Neural Networks. The framework supports both transductive integration and inductive prediction, enabling robust classification of new patients with partial omics data. Across simulated data, AML case studies, and pan-cancer benchmarks, IntegrAO demonstrates superior robustness to missing data, refined subtyping with biological and clinical relevance, and reliable subtype prediction for unseen patients. This modality-agnostic approach has significant implications for precision oncology, offering a scalable, data-efficient path to holistic patient characterization and decision support.

Abstract

High-throughput omics profiling advancements have greatly enhanced cancer patient stratification. However, incomplete data in multi-omics integration presents a significant challenge, as traditional methods like sample exclusion or imputation often compromise biological diversity and dependencies. Furthermore, the critical task of accurately classifying new patients with partial omics data into existing subtypes is commonly overlooked. To address these issues, we introduce IntegrAO (Integrate Any Omics), an unsupervised framework for integrating incomplete multi-omics data and classifying new samples. IntegrAO first combines partially overlapping patient graphs from diverse omics sources and utilizes graph neural networks to produce unified patient embeddings. Our systematic evaluation across five cancer cohorts involving six omics modalities demonstrates IntegrAO's robustness to missing data and its accuracy in classifying new samples with partial profiles. An acute myeloid leukemia case study further validates its capability to uncover biological and clinical heterogeneity in incomplete datasets. IntegrAO's ability to handle heterogeneous and incomplete data makes it an essential tool for precision oncology, offering a holistic approach to patient characterization.
Paper Structure (27 sections, 21 equations, 12 figures, 4 tables)

This paper contains 27 sections, 21 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of the IntegrAO framework. (a) Step 1: Example representation of cell composition, mRNA expression, microRNA expression, DNA methylation and copy number variation datasets are used to construct per-omics patient graphs. Patient data need not encompass all omics types. Subsequently, a fusion phase iteratively refines each graph with information gathered from other graphs, culminating in a unified graph for each type of omics. Step 2: Both these unified graphs and their corresponding omics features are input into omics-specific Graph Neural Networks (GNNs) to learn patient embeddings. These low-dimensional patient embeddings are optimized to retain similarity information from the individual unified graphs while minimizing differences in embeddings for the same patients across different omics. Step 3: The conclusive embeddings are procured by averaging omics-specific embeddings and applied in the construction of the final integrated patient graph. (b) Conversion of IntegrAO into a predictive framework. Utilizing the integrated graph, patient subtypes can be identified and leveraged to fine-tune the trained IntegrAO model. The fine-tuned IntegrAO model enables the classification of new patients with any accessible omics data. During the inference process, graph fusion is first conducted on new patients along with existing patients. The consequent fused graph and associated omics features are then input into the fine-tuned IntegrAO model, allowing for the prediction of patient subtypes.
  • Figure 2: Benchmarking partial multi-omics integration between IntegrAO, NEMO, and MSNE on simulated multi-omics cancer dataset using Normalized Mutual Information (NMI). (a) NMI versus overlapping data ratio across three missing scenarios (n=10 experiments for each ratio). Means of evaluation metrics with standard deviations from different experiments are shown in the figure, where the error bar represents plus/minus one standard deviation. From left to right: Uniform random subsampling of DNA methylation and protein expression with intact mRNA expression; Uniform random subsampling of mRNA expression and DNA methylation with intact protein expression; Uniform random subsampling of mRNA expression and protein expression with intact DNA methylation. IntegrAO demonstrates superior performance in all scenarios. (b) IntegrAO outperforms other methods in a more challenging scenario where all omic data are partially missing. (c) An illustrative example with a 70% data overlap ratio, showing 350 common and 50 unique samples per modality. (d) Pre-integration UMAP visualizations for each modality for the 70% all-missing data scenario, highlighting both common and unique samples. (e) Post-integration UMAP visualization of patient embeddings via IntegrAO. Upon integration, clustering resolution was enhanced with unique samples from each network showing improved alignment.
  • Figure 3: Multi-omics integrative analysis of acute myeloid leukemia (AML) elucidating intertumor heterogeneity. (a) IntegrAO discerns 12 subtypes with distinct hierarchical composition, transcriptomic profiles, and mutational patterns, preserving granular differentiations. (b) IntegrAO subtypes demonstrate greater differential survival versus individual datasets. (c) More significantly sensitive drugs are revealed by IntegrAO versus single data types. (d) Hematopoietic lineage enrichment analysis validates subtype differentiation, underscoring captured heterogeneity.
  • Figure 4: Comparative analysis of IntegrAO, NEMO, and MSNE across 5 cancer types with partial multi-omics data. The x-axis depicts differential survival between clusters, quantified by -log10 of the P-value from age-adjusted nested log-rank testing (higher indicates greater survival differentiation). The y-axis shows the number of enriched clinical parameters within clusters (higher denotes more parameters enriched). Each plot compares methods for a cancer dataset for different cluster numbers. Overall, IntegrAO more reliably identifies clusters with both better survival differentiation and higher clinical enrichment than other methods.
  • Figure 5: Performance comparison of new patient classification using IntegrAO versus MLP, SVM, XGBoost, Random Forest, and KNN under different omic combinations. Accuracy, F1-macro, and F1-weighted were evaluated, with means and standard deviations from multiple experiments displayed (error bars denote ±1 standard deviation). mRNA, meth, and miRNA refer to single-omic classification using mRNA expression, DNA methylation, and miRNA expression data respectively. miRNA+meth, miRNA+mRNA, and meth+mRNA indicate classification with two omics, while "all" used all three data types. Across all metrics and inputs, IntegrAO substantially outperforms other methods, highlighting its ability to effectively leverage diverse omics for integrative patient classification.
  • ...and 7 more figures