Table of Contents
Fetching ...

A Bioinformatic Approach Validated Utilizing Machine Learning Algorithms to Identify Relevant Biomarkers and Crucial Pathways in Gallbladder Cancer

Rabea Khatun, Wahia Tasnim, Maksuda Akter, Md Manowarul Islam, Md. Ashraf Uddin, Md. Zulfiker Mahmud, Saurav Chandra Das

TL;DR

NTRK2, COL14A1, SCN4B, ATP1A2, SLC17A7, SLIT3, COL7A1, CLDN4, CLEC3B, ADCYAP1R1, and MFAP4 were identified as crucial genes, with SLIT3, COL7A1, and CLDN4 being strongly linked to GBC development and prediction.

Abstract

Gallbladder cancer (GBC) is the most frequent cause of disease among biliary tract neoplasms. Identifying the molecular mechanisms and biomarkers linked to GBC progression has been a significant challenge in scientific research. Few recent studies have explored the roles of biomarkers in GBC. Our study aimed to identify biomarkers in GBC using machine learning (ML) and bioinformatics techniques. We compared GBC tumor samples with normal samples to identify differentially expressed genes (DEGs) from two microarray datasets (GSE100363, GSE139682) obtained from the NCBI GEO database. A total of 146 DEGs were found, with 39 up-regulated and 107 down-regulated genes. Functional enrichment analysis of these DEGs was performed using Gene Ontology (GO) terms and REACTOME pathways through DAVID. The protein-protein interaction network was constructed using the STRING database. To identify hub genes, we applied three ranking algorithms: Degree, MNC, and Closeness Centrality. The intersection of hub genes from these algorithms yielded 11 hub genes. Simultaneously, two feature selection methods (Pearson correlation and recursive feature elimination) were used to identify significant gene subsets. We then developed ML models using SVM and RF on the GSE100363 dataset, with validation on GSE139682, to determine the gene subset that best distinguishes GBC samples. The hub genes outperformed the other gene subsets. Finally, NTRK2, COL14A1, SCN4B, ATP1A2, SLC17A7, SLIT3, COL7A1, CLDN4, CLEC3B, ADCYAP1R1, and MFAP4 were identified as crucial genes, with SLIT3, COL7A1, and CLDN4 being strongly linked to GBC development and prediction.

A Bioinformatic Approach Validated Utilizing Machine Learning Algorithms to Identify Relevant Biomarkers and Crucial Pathways in Gallbladder Cancer

TL;DR

NTRK2, COL14A1, SCN4B, ATP1A2, SLC17A7, SLIT3, COL7A1, CLDN4, CLEC3B, ADCYAP1R1, and MFAP4 were identified as crucial genes, with SLIT3, COL7A1, and CLDN4 being strongly linked to GBC development and prediction.

Abstract

Gallbladder cancer (GBC) is the most frequent cause of disease among biliary tract neoplasms. Identifying the molecular mechanisms and biomarkers linked to GBC progression has been a significant challenge in scientific research. Few recent studies have explored the roles of biomarkers in GBC. Our study aimed to identify biomarkers in GBC using machine learning (ML) and bioinformatics techniques. We compared GBC tumor samples with normal samples to identify differentially expressed genes (DEGs) from two microarray datasets (GSE100363, GSE139682) obtained from the NCBI GEO database. A total of 146 DEGs were found, with 39 up-regulated and 107 down-regulated genes. Functional enrichment analysis of these DEGs was performed using Gene Ontology (GO) terms and REACTOME pathways through DAVID. The protein-protein interaction network was constructed using the STRING database. To identify hub genes, we applied three ranking algorithms: Degree, MNC, and Closeness Centrality. The intersection of hub genes from these algorithms yielded 11 hub genes. Simultaneously, two feature selection methods (Pearson correlation and recursive feature elimination) were used to identify significant gene subsets. We then developed ML models using SVM and RF on the GSE100363 dataset, with validation on GSE139682, to determine the gene subset that best distinguishes GBC samples. The hub genes outperformed the other gene subsets. Finally, NTRK2, COL14A1, SCN4B, ATP1A2, SLC17A7, SLIT3, COL7A1, CLDN4, CLEC3B, ADCYAP1R1, and MFAP4 were identified as crucial genes, with SLIT3, COL7A1, and CLDN4 being strongly linked to GBC development and prediction.

Paper Structure

This paper contains 22 sections, 1 equation, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Flow diagram of proposed methodology: Firstly, two microarray datasets (GSE100363, GSE139682) were downloaded from GEO. Secondly, differentially expressed genes (DEGs) were identified from those datasets. Next, the Gene Ontology analysis and Pathway analysis was performed with the identified DEGs to screen significant GO terms and pathways. After that, the protein-protein interaction (PPI) network was constructed. Subsequently, three ranking algorithms (Degree, MNC, Closeness Centrality) was employed to identify top 15 hub genes which, surprisingly, provided overlapped 11 real hub genes. In parallel, two feature selection methods (pearson correlation and recursive feature elimination) was employed to further identify significant gene subsets. Afterwards, the hub genes and significant genes subset were trained on GSE 100363 dataset to develop machine learning model using SVM and RF algorithm. Finally, the model was tested using independent GSE 139682 dataset to validate the biomarkers. Additionally, the real hub genes were validated using GEPIA database.
  • Figure 2: Distribution of number of samples of both tumor and normal among selected datasets. The GSE139682 series comprises 20 samples: 10 gallbladder tumors and 10 normal samples. The GSE100363 series comprises 8 samples: 4 gallbladder tumours and 4 normal samples.
  • Figure 3: Venn intersection diagrams of the DEGs of the two datasets: (a) represents the common DEGs, and (b) represents the common upregulated genes.(c) represents the common downregulated genes.
  • Figure 4: Volcano plots of differentially expressed genes. Blue data points indicates down-regulated genes and red data points indicates up-regulated genes. The $|logFC| > 1$ for overexpression and $|logFC| <-1$ for downexpression was applied to set up differences. Black dots indicated genes that were not differentially expressed.
  • Figure 5: Functional enrichment analyses: GO terms and REACTOME pathways for upregulated DEGs in this study. BP: Upregulated DEGs enriched in biological Process. CC: Upregulated DEGs enriched in cellular component. MF: Upregulated DEGs enriched in molecular function. The size of the bubble indicates the enrichment score; colors indicate enrichment signifcane.
  • ...and 6 more figures