Screening of BindingDB database ligands against EGFR, HER2, Estrogen, Progesterone and NF-kB receptors based on machine learning and molecular docking
Parham Rezaee, Shahab Rezaee, Malik Maaza, Seyed Shahriar Arab
TL;DR
This study addresses the need for faster identification of breast cancer inhibitors by combining machine learning with structure-based docking to screen BindingDB ligands against EGFR/HER2, ER, NF-κB, and PR targets. A GA-driven feature selection with SVM/RF classifiers builds binary active/inactive and multiclass target predictors, and the best pipeline GA-SVM-SVM:GA-SVM-SVM achieves 0.74 accuracy and 0.94 AUC while delivering thousands of high-precision hits. Docking with AutoDock Vina validates binding energies in a favorable range (-15 to -5 kcal/mol) and subsequent medicinal-chemistry filters (Lipinski, Pfizer, GSK, golden triangle, QED, SAscore, MCE-18) prune to a small set of prioritised ligands. An interpretable SAR framework via permutation importance and a simple dendrogram enables rapid target assignment for new hits, supporting downstream MD, in vitro, and in vivo studies.
Abstract
Breast cancer, the second most prevalent cancer among women worldwide, necessitates the exploration of novel therapeutic approaches. To target the four subgroups of breast cancer "hormone receptor-positive and HER2-negative, hormone receptor-positive and HER2-positive, hormone receptor-negative and HER2-positive, and hormone receptor-negative and HER2-negative" it is crucial to inhibit specific targets such as EGFR, HER2, ER, NF-kB, and PR. In this study, we evaluated various methods for binary and multiclass classification. Among them, the GA-SVM-SVM:GA-SVM-SVM model was selected with an accuracy of 0.74, an F1-score of 0.73, and an AUC of 0.94 for virtual screening of ligands from the BindingDB database. This model successfully identified 4454, 803, 438, and 378 ligands with over 90% precision in both active/inactive and target prediction for the classes of EGFR+HER2, ER, NF-kB, and PR, respectively, from the BindingDB database. Based on to the selected ligands, we created a dendrogram that categorizes different ligands based on their targets. This dendrogram aims to facilitate the exploration of chemical space for various therapeutic targets. Ligands that surpassed a 90% threshold in the product of activity probability and correct target selection probability were chosen for further investigation using molecular docking. The binding energy range for these ligands against their respective targets was calculated to be between -15 and -5 kcal/mol. Finally, based on general and common rules in medicinal chemistry, we selected 2, 3, 3, and 8 new ligands with high priority for further studies in the EGFR+HER2, ER, NF-kB, and PR classes, respectively.
