A Comparative Analysis of the Ensemble Methods for Drug Design
Rifkat Davronova, Fatima Adilovab
TL;DR
This paper tackles the QSAR prediction challenge by systematically evaluating ensemble approaches across multiple regression datasets in drug design. It constructs 57 configurations (2 ensemble methods × 19 base regressors, with base models evaluated alone) and tests them on four drug-design datasets, using cross-validation to compare performance. The findings show that ensemble methods generally improve predictive accuracy over single models, with additive regression (AR) often delivering the best ensemble performance and Extra Trees Regressor excelling among single models; Support Vector Machines frequently emerge as the strongest base learner. Dimension reduction via feature selection mostly does not improve results, except for one dataset, and clustering analyses suggest ensemble decisions align with base-model behavior, supporting broader adoption of ensembles in QSAR modeling. The work also provides public code, enhancing reproducibility and aiding practitioners in applying these methods to drug-design problems.
Abstract
Quantitative structure-activity relationship (QSAR) is a computer modeling technique for identifying relationships between the structural properties of chemical compounds and biological activity. QSAR modeling is necessary for drug discovery, but it has many limitations. Ensemble-based machine learning approaches have been used to overcome limitations and generate reliable predictions. Ensemble learning creates a set of diverse models and combines them. In our comparative analysis, each ensemble algorithm was paired with each of the basic algorithms, but the basic algorithms were also investigated separately. In this configuration, 57 algorithms were developed and compared on 4 different datasets. Thus, a technique for complex ensemble method is proposed that builds diversified models and integrates them. The proposed individual models did not show impressive results as a unified model, but it was considered the most important predictor when combined. We assessed whether ensembles always give better results than individual algorithms. The Python code written to get experimental results in this article has been uploaded to Github (https://github.com/rifqat/Comparative-Analysis).
