GATher: Graph Attention Based Predictions of Gene-Disease Links

David Narganes-Carlon; Anniek Myatt; Mani Mudaliar; Daniel J. Crowther

GATher: Graph Attention Based Predictions of Gene-Disease Links

David Narganes-Carlon, Anniek Myatt, Mani Mudaliar, Daniel J. Crowther

TL;DR

GATher outperforms existing models like GAT, GATv2, and HGT in predicting clinical trial outcomes, demonstrating its potential in enhancing target validation and predicting clinical efficacy and safety.

Abstract

Target selection is crucial in pharmaceutical drug discovery, directly influencing clinical trial success. Despite its importance, drug development remains resource-intensive, often taking over a decade with significant financial costs. High failure rates highlight the need for better early-stage target selection. We present GATher, a graph attention network designed to predict therapeutic gene-disease links by integrating data from diverse biomedical sources into a graph with over 4.4 million edges. GATher incorporates GATv3, a novel graph attention convolution layer, and GATv3HeteroConv, which aggregates transformations for each edge type, enhancing its ability to manage complex interactions within this extensive dataset. Utilizing hard negative sampling and multi-task pre-training, GATher addresses topological imbalances and improves specificity. Trained on data up to 2018 and evaluated through 2024, our results show GATher predicts clinical trial outcomes with a ROC AUC of 0.69 for unmet efficacy failures and 0.79 for positive efficacy. Feature attribution methods, using Captum, highlight key nodes and relationships, enhancing model interpretability. By 2024, GATher improved precision in prioritizing the top 200 clinical trial targets to 14.1%, an absolute increase of over 3.5% compared to other methods. GATher outperforms existing models like GAT, GATv2, and HGT in predicting clinical trial outcomes, demonstrating its potential in enhancing target validation and predicting clinical efficacy and safety.

GATher: Graph Attention Based Predictions of Gene-Disease Links

TL;DR

Abstract

Paper Structure (38 sections, 7 equations, 7 figures, 8 tables)

This paper contains 38 sections, 7 equations, 7 figures, 8 tables.

Introduction
Target Discovery
Biomedical Graphs
Previous Work
Graph Attention Networks
GATher, GATv3, and GATv3HeteroConv
Results
The Pipeline
Performance of GATv3
First-in-Class and Siren Targets
Performance of GATv3 HeteroConv
The Graph Features
Graph Explanations
Discussion
GATher and GATv3
...and 23 more sections

Figures (7)

Figure 1: Workflow: The pipeline begins with node and edge input data (A and D), moves through GATher's schematic (B and E), and concludes with model training (C) and F). A displays connections among drugs (green), gene targets (blue), and diseases (red), with edges representing biological and chemical relationships. B) presents the GATher model's schematic, highlighting the encoder-decoder architecture for classifying and regressing edges. C) outlines the training process: training on 80% of the 2018 data, testing on 10%, and validating on another 10% from 2018, with additional validation on a prospective future dataset. The multi-stage training includes initial pre-training and fine-tuning for clinical trial regression. D) details a subgraph illustrating the inhibition of JAK1 and JAK2 by tofacitinib for ulcerative colitis treatment. Entities are shown as coloured vectors with gradient colours: drugs (blue), diseases (red), and drugs (green). E) demonstrates edge formation by transforming node pairs using GATv3’s attention mechanism, with each edge type having a dedicated layer. Nodes integrate features and graph information post-encoding. F) describes multi-task training, where edges are decoded into classification probabilities and regression estimates. The final step is fine-tuning for clinical trial prediction of protein (target)-disease links for the four clinical trial outcomes.
Figure 2: Performance analysis of GATher by layer type and number. (A) Horizontal boxplots showing validation MSE for 1440 hyperparameter sets for each layer type using a single seed. Lower MSE values indicate better performance. (B) Horizontal boxplots showing test MSE for 1440 hyperparameter sets for each layer type using a single seed. (C) Horizontal boxplots showing validation MSE for 1 or 2 layers across 64 seeds, with scatter points for individual MSE values. (D) Horizontal boxplots showing test MSE for 1 or 2 layers across 64 seeds. (E) Heatmap of Mann-Whitney U test results on test MSE from 64 seeds, indicating differences in MSE distributions among layer types and numbers with log-transformed p-values and annotations for significant ones. (F) Heatmap showing Mann-Whitney U test results for test MSE across 1440 hyperparameter sets with one seed, highlighting significant p-values. The figure visualises GATher's performance segmented by layer type and depth, evaluating layers such as GATv3 (ours), GATv2, GAT, and HGT as implemented in Pytorch Geometric pytorchGeom2019.
Figure 3: Evaluation of clinical phase progression predictions for first-in-class and siren targets. (A) Precision over time for predictions of target-disease pairs advancing into clinical trials, showing the GATher median precision and standard deviation across multiple diseases with their individual MONDO terms annotated in grey with grey lines. (B) Scatter plot of predicted maximum trial phases with unmet and positive efficacy in 2018, grouped by clinical phase progress. Lines of best fit for each progress category are displayed alongside the 'y = x' line. First-in-class targets and siren targets are highlighted with specific symbols. (C) Density plot comparing predicted positive efficacy scores between targets advancing in clinical trials and those with no further trial progression. The decision threshold is indicated by the dashed vertical line. (D) Density plot comparing predicted unmet efficacy scores between targets with recent efficacy failures and those without. The decision threshold is also indicated from Table \ref{['tab:performance']}.
Figure 4: Performance and justification of the GATv3 HeteroConv model. A: Distribution of mean squared error (MSE) values on the test dataset across different attention layers: GAT, GATv2, GATv3, and HGT. This shows the variability and central tendency of MSE for each layer. B: Distribution of MSE values on the validation dataset for different attention layers, indicating consistency of performance across different runs. C: Bar chart of GATv3 HeteroConv Alpha values for different relations. Each bar represents a relation, with its length indicating the alpha value. Colours represent different categories of linked entities as shown in the legend. D: Comparison of GATv3 HeteroConv Alpha values between a shallow 1-layer model and a deep 2-layer model. Each point represents a relation, with its position based on the alpha values in the shallow model (x-axis) and the deep model (y-axis). The dashed line (y = x) shows where the alpha values would be equal for both models. Colours indicate different entity categories as in C. E: Mean PyTorch GAT attention against the number of edges for each relation. The y-axis is the number of edges (log scale), and the x-axis is the mean attention. Colours represent different categories of relations, showing the correlation between attention and edge count. F: GATv3 HeteroConv Alpha values against the number of edges for a shallow 1-layer model. The y-axis is the number of edges (log scale), and the x-axis is the alpha value. The figure shows the correlation between alpha values and edge count for various relations. The colormap for the relations is the same for subplots C, D, E, and F.
Figure 5: PCA Projections (A to F) and Evaluation of GATher Features (G to I). Engineered Features (A-C): (A) Gene Clustering: PCA of Human Protein Atlas single-cell data shows cell-specific markers in T lymphocytes (blue), melanocytes (green), and oligodendrocytes (red). Highlighted genes include TRA, TRB, MLANA, OLIG1, and OLIG2, with proteins ERBB2 and ERBB3. (B) Disease Clustering: PCA of GPT disease embeddings categorises diseases like carcinoma (blue), congenital disorders (green), and metabolic disorders (purple), marking NAFLD and NASH. (C) Drug Clustering: PCA of SMILES fingerprints identifies drug classes like ACE inhibitors, PDE inhibitors, and beta-blockers, with kinase inhibitors lapatinib and afatinib marked. Learned Features (D-F): (D) GATher Embedding Clustering: PCA shows clusters for genes (blue), diseases (red), functions (purple), pathways (orange), and drugs (green), highlighting processes like EGFR signaling and DNA repair. (E) GATher Layer 1 Clustering: PCA after the first GATv3 encoder. (F) GATher Layer 2 Clustering: PCA after the second GATv3 encoder. Performance Evaluations (G-I, RMSE): (G) Learned Features Alone: Boxplots show RMSE impact by model depth on positive efficacy trials (Score 0.5 preclinical to 4 FDA-approved). (H) Engineered Features Alone: Boxplots show RMSE variation with model depth. (I) Engineered and Learned Features: Boxplots indicate better performance with combined features, balancing manual and automatic extraction.
...and 2 more figures

GATher: Graph Attention Based Predictions of Gene-Disease Links

TL;DR

Abstract

GATher: Graph Attention Based Predictions of Gene-Disease Links

Authors

TL;DR

Abstract

Table of Contents

Figures (7)