Table of Contents
Fetching ...

Heterogeneous graph attention network improves cancer multiomics integration

Sina Tabakhi, Charlotte Vandermeulen, Ian Sudbery, Haiping Lu

TL;DR

HeteroGATomics addresses the challenge of integrating high-dimensional multiomics data from small cohorts by combining joint feature selection with heterogeneous graph learning. The framework jointly selects informative features across omics via a multi-agent system and then learns modality-specific heterogeneous graphs with relation-aware GATs, followed by a late fusion to predict cancer types. It achieves superior diagnostic performance across BLCA, LGG, and RCC datasets and provides interpretable biomarker identification, including known cancer genes and novel targets. This dual-view, attention-based approach enhances expressivity, interpretability, and potential therapeutic insights in cancer multiomics analysis.

Abstract

The increase in high-dimensional multiomics data demands advanced integration models to capture the complexity of human diseases. Graph-based deep learning integration models, despite their promise, struggle with small patient cohorts and high-dimensional features, often applying independent feature selection without modeling relationships among omics. Furthermore, conventional graph-based omics models focus on homogeneous graphs, lacking multiple types of nodes and edges to capture diverse structures. We introduce a Heterogeneous Graph ATtention network for omics integration (HeteroGATomics) to improve cancer diagnosis. HeteroGATomics performs joint feature selection through a multi-agent system, creating dedicated networks of feature and patient similarity for each omic modality. These networks are then combined into one heterogeneous graph for learning holistic omic-specific representations and integrating predictions across modalities. Experiments on three cancer multiomics datasets demonstrate HeteroGATomics' superior performance in cancer diagnosis. Moreover, HeteroGATomics enhances interpretability by identifying important biomarkers contributing to the diagnosis outcomes.

Heterogeneous graph attention network improves cancer multiomics integration

TL;DR

HeteroGATomics addresses the challenge of integrating high-dimensional multiomics data from small cohorts by combining joint feature selection with heterogeneous graph learning. The framework jointly selects informative features across omics via a multi-agent system and then learns modality-specific heterogeneous graphs with relation-aware GATs, followed by a late fusion to predict cancer types. It achieves superior diagnostic performance across BLCA, LGG, and RCC datasets and provides interpretable biomarker identification, including known cancer genes and novel targets. This dual-view, attention-based approach enhances expressivity, interpretability, and potential therapeutic insights in cancer multiomics analysis.

Abstract

The increase in high-dimensional multiomics data demands advanced integration models to capture the complexity of human diseases. Graph-based deep learning integration models, despite their promise, struggle with small patient cohorts and high-dimensional features, often applying independent feature selection without modeling relationships among omics. Furthermore, conventional graph-based omics models focus on homogeneous graphs, lacking multiple types of nodes and edges to capture diverse structures. We introduce a Heterogeneous Graph ATtention network for omics integration (HeteroGATomics) to improve cancer diagnosis. HeteroGATomics performs joint feature selection through a multi-agent system, creating dedicated networks of feature and patient similarity for each omic modality. These networks are then combined into one heterogeneous graph for learning holistic omic-specific representations and integrating predictions across modalities. Experiments on three cancer multiomics datasets demonstrate HeteroGATomics' superior performance in cancer diagnosis. Moreover, HeteroGATomics enhances interpretability by identifying important biomarkers contributing to the diagnosis outcomes.
Paper Structure (9 sections, 11 equations, 13 figures, 6 tables, 2 algorithms)

This paper contains 9 sections, 11 equations, 13 figures, 6 tables, 2 algorithms.

Figures (13)

  • Figure 1: HeteroGATomics architecture.a, HeteroGATomics integrates joint feature selection and heterogeneous graph learning in six steps. (1) HeteroGATomics represents the preprocessed omics as feature similarity networks, where each network represents a specific omic with nodes corresponding to features and edges denoting their correlations. All omic modalities are interconnected at the raw feature level to capture cross-modality interactions. (2) An MAS performs joint feature selection on these networks to select informative features, considering both intra- and cross-modality interactions. (3) HeteroGATomics builds a patient similarity network for each omic and combines it with the feature similarity network to construct a heterogeneous graph. (4) GAT encoders learn the representations of each individual heterogeneous graph. (5) A single-layer neural network predicts patient labels from the learned representations. (6) A late fusion combines predicted labels from all modalities and feeds them into a VCDN network to perform downstream tasks. b, The heterogeneous graph construction combines feature and patient similarity networks through feature-patient relations. c, Multiple stacked GAT layers (denoted by L) encodes the heterogeneous graph into hidden representations for each node type. Each layer uses three GATs to learn the three relations within the graph, updating node representations by aggregating relation-specific information.
  • Figure 2: Performance comparison of HeteroGATomics with its feature selection module across five classifiers (mean and standard deviation over 10-fold cross-validation). The vertical bars show the mean, the black lines represent error bars indicating plus/minus one standard deviation, and each dot is a model's performance on each fold. HeteroGATomicsMAS + [classifier] denotes the results of the feature selection module within HeteroGATomics for a classifier, while HeteroGATomics represents the results derived from the entire HeteroGATomics architecture.
  • Figure 3: Comparison of HeteroGATomics performance with and without heterogeneous graphs on the LGG dataset (mean and standard deviation over 10-fold cross-validation). The vertical bars show the mean, the black lines represent error bars indicating plus/minus one standard deviation, and each dot is a model's performance on each fold. Homogeneous refers to HeteroGATomics without the feature similarity network, HeteroFeature removes edge attributes (correlation, edge desirability), HeteroEdge excludes node attributes (relevance, node desirability), and HeteroFeature+Edge represents the full HeteroGATomics setup.
  • Figure 4: Performance comparison of HeteroGATomics across different combinations of modalities on the LGG dataset (mean and standard deviation over 10-fold cross-validation). The vertical bars show the mean, the black lines represent error bars indicating plus/minus one standard deviation, and each dot is a model's performance on each fold. DNA, mRNA, and miRNA refer to the single-modality classification performance on DNA methylation, gene expression RNAse, and miRNA mature strand expression RNAseq, respectively. Two-modality combinations refer to DNA+mRNA, DNA+miRNA, and mRNA+miRNA, while DNA+mRNA+miRNA refers to the classification performance across three modalities. In each case, 300 features are selected and divided equally among the modalities.
  • Figure 5: Known partners of selected top biomarkers.a, Results for the BLCA dataset. b, Results for the LGG dataset. Direct protein-protein interactions are recovered for DNA and mRNA omics. For the miRNA omics, known mRNA targets are recovered from starBase li2014starbase. The different omic categories from which the biomarkers originate are indicated as blue (DNA), green (mRNA) and orange (miRNA). Known cancer-related genes from the Cancer Gene Census database, OncoKB™ Cancer Gene List, and the Network Cancer Genome are circled in red sondka2024cosmicchakravarty2017oncokbrepana2019ncg.
  • ...and 8 more figures