Table of Contents
Fetching ...

Identifying genes associated with phenotypes using machine and deep learning

Muhammad Muneeb, David B. Ascher, YooChan Myung

TL;DR

It is suggested that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.

Abstract

Identifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.

Identifying genes associated with phenotypes using machine and deep learning

TL;DR

It is suggested that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.

Abstract

Identifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.
Paper Structure (14 sections, 1 equation, 6 figures, 7 tables)

This paper contains 14 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: A workflow of identifying genes associated with phenotypes. First, clean phenotype data, convert genotype data to PLINK format, and perform quality control steps on genotype data. Then, download gene associations for each phenotype. Split the data into five folds and generate bed, bim, and fam files for each split. Divide genotype data into training and test sets, generate sub-datasets using p-value thresholds, pass data to machine/deep learning models, and report AUC, MCC, and F1-Score for each model and p-value thresholds. Finally, list the feature (SNPs) importance for the best-performing models, compare the identified SNPs with those downloaded from the GWAS Catalog, and report the gene identification ratio for each phenotype.
  • Figure 2: A process of listing features identified by machine and deep learning models. We extracted the top-ranked SNPs from the best-performing machine and deep learning models in terms of AUC, F1 Score, and MCC. Those SNPs were compared with the actual SNPs from the GWAS Catalog.
  • Figure 3: A heatmap showing the AUC, F1 score, and MCC for machine and deep learning. This heatmap shows the classification results from the best-performing models from the deep (DL) and machine (ML) learning algorithms in terms of AUC, F1 score, and MCC.
  • Figure 4: Heatmap of genes identified by the best-performing ML/DL models according to AUC, F1 score, and MCC. The first column shows the number of SNPs or genes overlapping between the genotype dataset and the GWAS Catalog. ML and DL denote machine learning and deep learning, respectively. Columns 2--7 show the number of SNPs identified by ML and DL models that achieved the best performance for AUC, F1 score, and MCC.
  • Figure 5: A heatmap showing the number of common SNPs between phenotypes.
  • ...and 1 more figures