Table of Contents
Fetching ...

MOTGNN: Interpretable Graph Neural Networks for Multi-Omics Disease Classification

Tiantian Yang, Zhiqian Chen

TL;DR

MOTGNN tackles the challenge of small-sample, high-dimensional, multi-omics disease prediction by generating modality-specific, supervised graphs via XGBoost trees and learning embeddings with GEDFN-based GNNs on each graph. A deep feedforward network then fuses these embeddings for binary classification, while providing end-to-end interpretability through feature- and omics-level importance scores. Across TCGA cancer datasets, MOTGNN consistently outperforms baselines and maintains robustness under class imbalance, with insights into which modalities and biomarkers drive predictions. The framework offers a scalable, interpretable approach for integrating heterogeneous omics data to enhance disease inference and biomarker discovery.

Abstract

Integrating multi-omics data, such as DNA methylation, mRNA expression, and microRNA (miRNA) expression, offers a comprehensive view of the biological mechanisms underlying disease. However, the high dimensionality of multi-omics data, the heterogeneity across modalities, and the lack of reliable biological interaction networks make meaningful integration challenging. In addition, many existing models rely on handcrafted similarity graphs, are vulnerable to class imbalance, and often lack built-in interpretability, limiting their usefulness in biomedical applications. We propose Multi-Omics integration with Tree-generated Graph Neural Network (MOTGNN), a novel and interpretable framework for binary disease classification. MOTGNN employs eXtreme Gradient Boosting (XGBoost) for omics-specific supervised graph construction, followed by modality-specific Graph Neural Networks (GNNs) for hierarchical representation learning, and a deep feedforward network for cross-omics integration. Across three real-world disease datasets, MOTGNN outperforms state-of-the-art baselines by 5-10% in accuracy, ROC-AUC, and F1-score, and remains robust to severe class imbalance. The model maintains computational efficiency through the use of sparse graphs and provides built-in interpretability, revealing both top-ranked biomarkers and the relative contributions of each omics modality. These results highlight the potential of MOTGNN to improve both predictive accuracy and interpretability in multi-omics disease modeling.

MOTGNN: Interpretable Graph Neural Networks for Multi-Omics Disease Classification

TL;DR

MOTGNN tackles the challenge of small-sample, high-dimensional, multi-omics disease prediction by generating modality-specific, supervised graphs via XGBoost trees and learning embeddings with GEDFN-based GNNs on each graph. A deep feedforward network then fuses these embeddings for binary classification, while providing end-to-end interpretability through feature- and omics-level importance scores. Across TCGA cancer datasets, MOTGNN consistently outperforms baselines and maintains robustness under class imbalance, with insights into which modalities and biomarkers drive predictions. The framework offers a scalable, interpretable approach for integrating heterogeneous omics data to enhance disease inference and biomarker discovery.

Abstract

Integrating multi-omics data, such as DNA methylation, mRNA expression, and microRNA (miRNA) expression, offers a comprehensive view of the biological mechanisms underlying disease. However, the high dimensionality of multi-omics data, the heterogeneity across modalities, and the lack of reliable biological interaction networks make meaningful integration challenging. In addition, many existing models rely on handcrafted similarity graphs, are vulnerable to class imbalance, and often lack built-in interpretability, limiting their usefulness in biomedical applications. We propose Multi-Omics integration with Tree-generated Graph Neural Network (MOTGNN), a novel and interpretable framework for binary disease classification. MOTGNN employs eXtreme Gradient Boosting (XGBoost) for omics-specific supervised graph construction, followed by modality-specific Graph Neural Networks (GNNs) for hierarchical representation learning, and a deep feedforward network for cross-omics integration. Across three real-world disease datasets, MOTGNN outperforms state-of-the-art baselines by 5-10% in accuracy, ROC-AUC, and F1-score, and remains robust to severe class imbalance. The model maintains computational efficiency through the use of sparse graphs and provides built-in interpretability, revealing both top-ranked biomarkers and the relative contributions of each omics modality. These results highlight the potential of MOTGNN to improve both predictive accuracy and interpretability in multi-omics disease modeling.

Paper Structure

This paper contains 12 sections, 9 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the proposed MOTGNN framework for multi-omics data integration and disease classification. The model comprises three key modules: (i) XGBoost for constructing omics-specific supervised feature graphs; (ii) graph neural network (GNN) for learning modality-specific embeddings by encoding each graph and its corresponding data matrix; and (iii) deep feedforward network (DFN) for integrating the learned embeddings and performing final classification.
  • Figure 2: Distribution of DNA methylation, mRNA, and miRNA features across the preprocessed COADREAD, LGG, and STAD datasets. Each omics dataset was independently scaled to the range of [0, 1] using min-max normalization. The distinct distributional patterns highlight the heterogeneous characteristics of different omics types.
  • Figure 3: Feature dimensions before and after XGBoost-based selection on COADREAD, LGG, and STAD datasets. The bar plots compare preprocessed feature dimensions (pre-selection) with reduced dimensions (post-selection), showing substantial dimensionality reduction across all omics types.
  • Figure 4: Comparison of classification performance across COADREAD, LGG, and STAD datasets. Bars show mean scores, and error bars represent 95% confidence intervals over 20 independent runs. MOTGNN consistently achieves the highest scores across accuracy, ROC-AUC, and F1 metrics.
  • Figure 5: F1-score comparison across 20 independent runs on the imbalanced COADREAD dataset (class ratio 254:78). (A) Box plots display the median (labeled), interquartile range, and outliers. (B) Violin plots illustrate the distribution of F1-scores with labeled medians. MOTGNN achieves the highest median score with the lowest variability, demonstrating stable and robust performance under class imbalance.
  • ...and 1 more figures