Table of Contents
Fetching ...

Predicting Drug Effects from High-Dimensional, Asymmetric Drug Datasets by Using Graph Neural Networks: A Comprehensive Analysis of Multitarget Drug Effect Prediction

Avishek Bose, Guojing Cong

TL;DR

This study tackles multitarget drug-effect prediction in high-dimensional, asymmetrically distributed datasets using graph neural networks. It introduces a hybrid GNN framework that fuses graph-based embeddings with fingerprint features and proposes a multilabel oversampling technique with complexity $O(n)$ and oversampling level $P=0.25$, addressing label imbalance and co-occurrence. The best-performing model—hybrid AttentiveFP with fingerprint input—outperforms baselines including MLSMOTE-oversampled GNNs on multilabel classification and improves multiregression metrics, demonstrating robustness across five datasets. Overall, the work advances practical multitarget drug-effect prediction by mitigating imbalance and asymmetric co-occurrence, with implications for more reliable in silico drug evaluation.

Abstract

Graph neural networks (GNNs) have emerged as one of the most effective ML techniques for drug effect prediction from drug molecular graphs. Despite having immense potential, GNN models lack performance when using datasets that contain high-dimensional, asymmetrically co-occurrent drug effects as targets with complex correlations between them. Training individual learning models for each drug effect and incorporating every prediction result for a wide spectrum of drug effects are impractical. Therefore, an opportunity exists to address this challenge as multitarget prediction problems and predict all drug effects at a time. We developed standard and hybrid GNNs to perform two separate tasks: multiregression for continuous values and multilabel classification for categorical values contained in our datasets. Because multilabel classification makes the target data even more sparse and introduces asymmetric label co-occurrence, learning these models becomes difficult and heavily impacts the GNN's performance. To address these challenges, we propose a new data oversampling technique to improve multilabel classification performances on all the given imbalanced molecular graph datasets. Using the technique, we improve the data imbalance ratio of the drug effects while protecting the datasets' integrity. Finally, we evaluate the multilabel classification performance of the best-performing hybrid GNN model on all the oversampled datasets obtained from the proposed oversampling technique. In all the evaluation metrics (i.e., precision, recall, and F1 score), this model significantly outperforms other ML models, including GNN models when they are trained on the original datasets or oversampled datasets with MLSMOTE, which is a well-known oversampling technique.

Predicting Drug Effects from High-Dimensional, Asymmetric Drug Datasets by Using Graph Neural Networks: A Comprehensive Analysis of Multitarget Drug Effect Prediction

TL;DR

This study tackles multitarget drug-effect prediction in high-dimensional, asymmetrically distributed datasets using graph neural networks. It introduces a hybrid GNN framework that fuses graph-based embeddings with fingerprint features and proposes a multilabel oversampling technique with complexity and oversampling level , addressing label imbalance and co-occurrence. The best-performing model—hybrid AttentiveFP with fingerprint input—outperforms baselines including MLSMOTE-oversampled GNNs on multilabel classification and improves multiregression metrics, demonstrating robustness across five datasets. Overall, the work advances practical multitarget drug-effect prediction by mitigating imbalance and asymmetric co-occurrence, with implications for more reliable in silico drug evaluation.

Abstract

Graph neural networks (GNNs) have emerged as one of the most effective ML techniques for drug effect prediction from drug molecular graphs. Despite having immense potential, GNN models lack performance when using datasets that contain high-dimensional, asymmetrically co-occurrent drug effects as targets with complex correlations between them. Training individual learning models for each drug effect and incorporating every prediction result for a wide spectrum of drug effects are impractical. Therefore, an opportunity exists to address this challenge as multitarget prediction problems and predict all drug effects at a time. We developed standard and hybrid GNNs to perform two separate tasks: multiregression for continuous values and multilabel classification for categorical values contained in our datasets. Because multilabel classification makes the target data even more sparse and introduces asymmetric label co-occurrence, learning these models becomes difficult and heavily impacts the GNN's performance. To address these challenges, we propose a new data oversampling technique to improve multilabel classification performances on all the given imbalanced molecular graph datasets. Using the technique, we improve the data imbalance ratio of the drug effects while protecting the datasets' integrity. Finally, we evaluate the multilabel classification performance of the best-performing hybrid GNN model on all the oversampled datasets obtained from the proposed oversampling technique. In all the evaluation metrics (i.e., precision, recall, and F1 score), this model significantly outperforms other ML models, including GNN models when they are trained on the original datasets or oversampled datasets with MLSMOTE, which is a well-known oversampling technique.

Paper Structure

This paper contains 10 sections, 5 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Generic network architecture of implemented hybrid GNN models for multiregression and multilabel classification.
  • Figure 2: Subfigures (a--d) display the percentages of data samples for each label in the datasets (data1, data2, data4, and data5).
  • Figure 3: Subfigures (a--f) display co-occurrence of labels for two oversampling methods in datasets data1 and data2.
  • Figure 4: Scatter plots (a--d) display the correlation between the target and predicted values for the given datasets.