Table of Contents
Fetching ...

A Multimodal Graph Neural Network Framework of Cancer Molecular Subtype Classification

Bingjun Li, Sheida Nabavi

TL;DR

A novel end-to-end multi-omics GNN framework for accurate and robust cancer subtype classification that utilizes multi-omics data in the form of heterogeneous multi-layer graphs, which combine both inter-omics and intra-omic connections from established biological knowledge.

Abstract

The recent development of high-throughput sequencing creates a large collection of multi-omics data, which enables researchers to better investigate cancer molecular profiles and cancer taxonomy based on molecular subtypes. Integrating multi-omics data has been proven to be effective for building more precise classification models. Current multi-omics integrative models mainly use early fusion by concatenation or late fusion based on deep neural networks. Due to the nature of biological systems, graphs are a better representation of bio-medical data. Although few graph neural network (GNN) based multi-omics integrative methods have been proposed, they suffer from three common disadvantages. One is most of them use only one type of connection, either inter-omics or intra-omic connection; second, they only consider one kind of GNN layer, either graph convolution network (GCN) or graph attention network (GAT); and third, most of these methods lack testing on a more complex cancer classification task. We propose a novel end-to-end multi-omics GNN framework for accurate and robust cancer subtype classification. The proposed model utilizes multi-omics data in the form of heterogeneous multi-layer graphs that combines both inter-omics and intra-omic connections from established biological knowledge. The proposed model incorporates learned graph features and global genome features for accurate classification. We test the proposed model on TCGA Pan-cancer dataset and TCGA breast cancer dataset for molecular subtype and cancer subtype classification, respectively. The proposed model outperforms four current state-of-the-art baseline models in multiple evaluation metrics. The comparative analysis of GAT-based models and GCN-based models reveals that GAT-based models are preferred for smaller graphs with less information and GCN-based models are preferred for larger graphs with extra information.

A Multimodal Graph Neural Network Framework of Cancer Molecular Subtype Classification

TL;DR

A novel end-to-end multi-omics GNN framework for accurate and robust cancer subtype classification that utilizes multi-omics data in the form of heterogeneous multi-layer graphs, which combine both inter-omics and intra-omic connections from established biological knowledge.

Abstract

The recent development of high-throughput sequencing creates a large collection of multi-omics data, which enables researchers to better investigate cancer molecular profiles and cancer taxonomy based on molecular subtypes. Integrating multi-omics data has been proven to be effective for building more precise classification models. Current multi-omics integrative models mainly use early fusion by concatenation or late fusion based on deep neural networks. Due to the nature of biological systems, graphs are a better representation of bio-medical data. Although few graph neural network (GNN) based multi-omics integrative methods have been proposed, they suffer from three common disadvantages. One is most of them use only one type of connection, either inter-omics or intra-omic connection; second, they only consider one kind of GNN layer, either graph convolution network (GCN) or graph attention network (GAT); and third, most of these methods lack testing on a more complex cancer classification task. We propose a novel end-to-end multi-omics GNN framework for accurate and robust cancer subtype classification. The proposed model utilizes multi-omics data in the form of heterogeneous multi-layer graphs that combines both inter-omics and intra-omic connections from established biological knowledge. The proposed model incorporates learned graph features and global genome features for accurate classification. We test the proposed model on TCGA Pan-cancer dataset and TCGA breast cancer dataset for molecular subtype and cancer subtype classification, respectively. The proposed model outperforms four current state-of-the-art baseline models in multiple evaluation metrics. The comparative analysis of GAT-based models and GCN-based models reveals that GAT-based models are preferred for smaller graphs with less information and GCN-based models are preferred for larger graphs with extra information.
Paper Structure (33 sections, 12 equations, 4 figures, 7 tables)

This paper contains 33 sections, 12 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The overall structure of the proposed model has four major modules shown as dotted grey rectangles. The input graph consists of inter-omics (red edges), intra-omic (blue edges) edges and miRNA-miRNA meta-path (black dashed edges), and three omics data, mRNA (orange boxes), CNV (yellow boxes), and miRNA (green boxes) is shown as the leftmost side. Module 1 consists of two parallel linear dimension-increase layers for gene-based nodes and miRNA-based nodes. The upgraded graph shown in the middle is obtained by feeding the node attributes from the input graph through module 1, where the dark orange boxes are the updated gene-based node attributes and the dark green boxes are the updated miRNA-based node attributes. Module 2 consists of two graph neural network layers, which can be any graph neural networks. The output of module 2 is then fed into a max pooling layer and then a transformation layer to obtain the learned graph representation (blue boxes). Module 3 consists of a decoder to reconstruct the graph representation back to the input graph node attributes. Module 4 consists of a shallow fully connected network that takes the updated node attributes as the input. The output of the parallel network (grey cubes) is then concatenated with the learned graph representation, and passes through a classification layer for the classification task.
  • Figure 2: The overall graph, supra-graph, is constructed from three different omic data on the left-hand side and two prior knowledge graphs on the right-hand side. mRNA (orange table) and CNV (yellow table) data are considered gene-based, which have the same dimension. miRNA (green table) data has the same number of rows but different feature lengths for each sample.
  • Figure 3: The number of cases in each molecular subtypes is shown. All samples from class 24 are excluded due to lack of miRNA data.
  • Figure 4: Performance of the Proposed Models and Baseline Models with Different Numbers of Genes on BRCA Dataset