Table of Contents
Fetching ...

Self-Normalizing Foundation Model for Enhanced Multi-Omics Data Analysis in Oncology

Asim Waqas, Aakash Tripathi, Sabeen Ahmed, Ashwin Mukund, Hamza Farooq, Matthew B. Schabath, Paul Stewart, Mia Naeini, Ghulam Rasool

TL;DR

SeNMo demonstrated robust performance on the classification task of predicting the primary cancer type of patients and significant performance in predicting tertiary lymph structures from multi-omics data, showing generalizability across cancer types, molecular data types, and clinical endpoints.

Abstract

Multi-omics research has enhanced our understanding of cancer heterogeneity and progression. Investigating molecular data through multi-omics approaches is crucial for unraveling the complex biological mechanisms underlying cancer, thereby enabling more effective diagnosis, treatment, and prevention strategies. However, predicting patient outcomes through the integration of all available multi-omics data is still an under-study research direction. Here, we present SeNMo, a foundation model that has been trained on multi-omics data across 33 cancer types. SeNMo is particularly efficient in handling multi-omics data characterized by high-width and low-length attributes. We trained SeNMo for the task of overall survival of patients using pan-cancer multi-omics data involving 33 cancer sites from the GDC. The training multi-omics data includes gene expression, DNA methylation, miRNA expression, DNA mutations, protein expression modalities, and clinical data. SeNMo was validated on two independent cohorts: Moffitt Cancer Center and CPTAC lung squamous cell carcinoma. We evaluated the model's performance in predicting patient's overall survival using the C-Index. SeNMo performed consistently well in the training regime, reflected by the validation C-Index of 0.76 on GDC's public data. In the testing regime, SeNMo performed with a C-Index of 0.758 on a held-out test set. The model showed an average accuracy of 99.8% on the task of classifying the primary cancer type on the pan-cancer test cohort. SeNMo demonstrated robust performance on the classification task of predicting the primary cancer type of patients. SeNMo further demonstrated significant performance in predicting tertiary lymph structures from multi-omics data, showing generalizability across cancer types, molecular data types, and clinical endpoints.

Self-Normalizing Foundation Model for Enhanced Multi-Omics Data Analysis in Oncology

TL;DR

SeNMo demonstrated robust performance on the classification task of predicting the primary cancer type of patients and significant performance in predicting tertiary lymph structures from multi-omics data, showing generalizability across cancer types, molecular data types, and clinical endpoints.

Abstract

Multi-omics research has enhanced our understanding of cancer heterogeneity and progression. Investigating molecular data through multi-omics approaches is crucial for unraveling the complex biological mechanisms underlying cancer, thereby enabling more effective diagnosis, treatment, and prevention strategies. However, predicting patient outcomes through the integration of all available multi-omics data is still an under-study research direction. Here, we present SeNMo, a foundation model that has been trained on multi-omics data across 33 cancer types. SeNMo is particularly efficient in handling multi-omics data characterized by high-width and low-length attributes. We trained SeNMo for the task of overall survival of patients using pan-cancer multi-omics data involving 33 cancer sites from the GDC. The training multi-omics data includes gene expression, DNA methylation, miRNA expression, DNA mutations, protein expression modalities, and clinical data. SeNMo was validated on two independent cohorts: Moffitt Cancer Center and CPTAC lung squamous cell carcinoma. We evaluated the model's performance in predicting patient's overall survival using the C-Index. SeNMo performed consistently well in the training regime, reflected by the validation C-Index of 0.76 on GDC's public data. In the testing regime, SeNMo performed with a C-Index of 0.758 on a held-out test set. The model showed an average accuracy of 99.8% on the task of classifying the primary cancer type on the pan-cancer test cohort. SeNMo demonstrated robust performance on the classification task of predicting the primary cancer type of patients. SeNMo further demonstrated significant performance in predicting tertiary lymph structures from multi-omics data, showing generalizability across cancer types, molecular data types, and clinical endpoints.
Paper Structure (27 sections, 6 equations, 11 figures, 5 tables)

This paper contains 27 sections, 6 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Overview of the SeNMo model. The data from public sources are collected using MINDS MINDS and curated to develop the multimodal dataset for SeNMo training. MINDS is a metadata framework for fusing publicly available data sources like TCGA-GDC and UCSC Xena Portal into machine learning-ready format TCGAUCSCXenaMINDS. The dataset is preprocessed and fed to the self-normalizing deep learning encoder network that learns underlying sub-visual patterns from cross-modality, pan-cancer data. The learned encoder weights are later used for different downstream tasks (with or without fine-tuning), such as predicting the overall survival (OS), progression-free survival (PFS), cancer subtype classification, grading, or tertiary lymph structures (TLS) ratio.
  • Figure 2: Features processing pipeline for pan-cancer data encompassing six data types: DNA Methylation, Gene Expression, miRNA Expression, Protein Expression, DNA Mutation, and Clinical features. Initial feature counts are reduced through preprocessing, with each modality unified at the pan-cancer level to yield unimodal pan-cancer feature sets. These unimodal features are further unified across all modalities, resulting in a final integrated multimodal feature matrix of 80,697 features, used for downstream analysis.
  • Figure 3: Summary of the survival data and number of cases in the pan-cancer data. The top panel illustrates patient survival in years, with each line representing an individual patient, marked as censored or dead, and grouped by cancer type. The bottom panel displays the number of cases for each primary cancer type, highlighting variation in sample sizes across cancers, which range from high counts in breast, kidney, and lung to lower counts in rare cancers like diffuse large B-cell lymphoma.
  • Figure 4: Architecture of the SeNMo encoder network. There are seven hidden layers each comprising of a linear unit, SELU activation, and alpha-dropout. The trained model has 83.33 million parameters. The number of neurons in each hidden layer, input layer, and output layer are also depicted in the figure. The same model is used for regression and classification tasks.
  • Figure 5: Study design and simulations structure for SeNMo across different learning regimes, datasets, and tasks. The baseline model (row 1) is first trained on TCGA data with 33 cancer types for overall survival (OS) prediction. Rows 2–5 represent variations: red-bordered boxes indicate a change from the baseline (e.g., out-of-distribution task and/or data), while green-bordered boxes align with the baseline. Simulations include OS prediction on both seen and unseen data (rows 2 and 3) and new tasks such as primary cancer type classification and TLS ratio prediction on seen and unseen datasets (rows 4 and 5).
  • ...and 6 more figures