Table of Contents
Fetching ...

Screening of BindingDB database ligands against EGFR, HER2, Estrogen, Progesterone and NF-kB receptors based on machine learning and molecular docking

Parham Rezaee, Shahab Rezaee, Malik Maaza, Seyed Shahriar Arab

TL;DR

This study addresses the need for faster identification of breast cancer inhibitors by combining machine learning with structure-based docking to screen BindingDB ligands against EGFR/HER2, ER, NF-κB, and PR targets. A GA-driven feature selection with SVM/RF classifiers builds binary active/inactive and multiclass target predictors, and the best pipeline GA-SVM-SVM:GA-SVM-SVM achieves 0.74 accuracy and 0.94 AUC while delivering thousands of high-precision hits. Docking with AutoDock Vina validates binding energies in a favorable range (-15 to -5 kcal/mol) and subsequent medicinal-chemistry filters (Lipinski, Pfizer, GSK, golden triangle, QED, SAscore, MCE-18) prune to a small set of prioritised ligands. An interpretable SAR framework via permutation importance and a simple dendrogram enables rapid target assignment for new hits, supporting downstream MD, in vitro, and in vivo studies.

Abstract

Breast cancer, the second most prevalent cancer among women worldwide, necessitates the exploration of novel therapeutic approaches. To target the four subgroups of breast cancer "hormone receptor-positive and HER2-negative, hormone receptor-positive and HER2-positive, hormone receptor-negative and HER2-positive, and hormone receptor-negative and HER2-negative" it is crucial to inhibit specific targets such as EGFR, HER2, ER, NF-kB, and PR. In this study, we evaluated various methods for binary and multiclass classification. Among them, the GA-SVM-SVM:GA-SVM-SVM model was selected with an accuracy of 0.74, an F1-score of 0.73, and an AUC of 0.94 for virtual screening of ligands from the BindingDB database. This model successfully identified 4454, 803, 438, and 378 ligands with over 90% precision in both active/inactive and target prediction for the classes of EGFR+HER2, ER, NF-kB, and PR, respectively, from the BindingDB database. Based on to the selected ligands, we created a dendrogram that categorizes different ligands based on their targets. This dendrogram aims to facilitate the exploration of chemical space for various therapeutic targets. Ligands that surpassed a 90% threshold in the product of activity probability and correct target selection probability were chosen for further investigation using molecular docking. The binding energy range for these ligands against their respective targets was calculated to be between -15 and -5 kcal/mol. Finally, based on general and common rules in medicinal chemistry, we selected 2, 3, 3, and 8 new ligands with high priority for further studies in the EGFR+HER2, ER, NF-kB, and PR classes, respectively.

Screening of BindingDB database ligands against EGFR, HER2, Estrogen, Progesterone and NF-kB receptors based on machine learning and molecular docking

TL;DR

This study addresses the need for faster identification of breast cancer inhibitors by combining machine learning with structure-based docking to screen BindingDB ligands against EGFR/HER2, ER, NF-κB, and PR targets. A GA-driven feature selection with SVM/RF classifiers builds binary active/inactive and multiclass target predictors, and the best pipeline GA-SVM-SVM:GA-SVM-SVM achieves 0.74 accuracy and 0.94 AUC while delivering thousands of high-precision hits. Docking with AutoDock Vina validates binding energies in a favorable range (-15 to -5 kcal/mol) and subsequent medicinal-chemistry filters (Lipinski, Pfizer, GSK, golden triangle, QED, SAscore, MCE-18) prune to a small set of prioritised ligands. An interpretable SAR framework via permutation importance and a simple dendrogram enables rapid target assignment for new hits, supporting downstream MD, in vitro, and in vivo studies.

Abstract

Breast cancer, the second most prevalent cancer among women worldwide, necessitates the exploration of novel therapeutic approaches. To target the four subgroups of breast cancer "hormone receptor-positive and HER2-negative, hormone receptor-positive and HER2-positive, hormone receptor-negative and HER2-positive, and hormone receptor-negative and HER2-negative" it is crucial to inhibit specific targets such as EGFR, HER2, ER, NF-kB, and PR. In this study, we evaluated various methods for binary and multiclass classification. Among them, the GA-SVM-SVM:GA-SVM-SVM model was selected with an accuracy of 0.74, an F1-score of 0.73, and an AUC of 0.94 for virtual screening of ligands from the BindingDB database. This model successfully identified 4454, 803, 438, and 378 ligands with over 90% precision in both active/inactive and target prediction for the classes of EGFR+HER2, ER, NF-kB, and PR, respectively, from the BindingDB database. Based on to the selected ligands, we created a dendrogram that categorizes different ligands based on their targets. This dendrogram aims to facilitate the exploration of chemical space for various therapeutic targets. Ligands that surpassed a 90% threshold in the product of activity probability and correct target selection probability were chosen for further investigation using molecular docking. The binding energy range for these ligands against their respective targets was calculated to be between -15 and -5 kcal/mol. Finally, based on general and common rules in medicinal chemistry, we selected 2, 3, 3, and 8 new ligands with high priority for further studies in the EGFR+HER2, ER, NF-kB, and PR classes, respectively.
Paper Structure (5 sections, 6 figures, 4 tables)

This paper contains 5 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The flowchart of these procedures.
  • Figure 2: The ROC plots for different classes with one-vs-rest strategy.
  • Figure 3: The left plot shows the importance of features using permutation importance method using GA-SVM-SVM model for target prediction and the right one demonstrates hierarchical clustering dendrogram using pearson method to find the correlation distance of each features.
  • Figure 4: A simple questionnaire dendrogram to separate ligands with a number of features and determine the targets of them.
  • Figure 5: Docking results of new ligands obtained from virtual screening. The pale dots in the following plots represent the active molecules in the BindingDB database for each class, the filled dots represent the molecules that participated in the construction of the model, and the red dots are the new molecules proposed by the model obtained from the screening.
  • ...and 1 more figures