Towards Benchmark Datasets for Machine Learning Based Website Phishing Detection: An experimental study
Abdelhakim Hannousse, Salima Yahiouche
TL;DR
This work tackles the lack of standardized benchmarks for ML-based website phishing detection by proposing a general scheme to construct reproducible and extensible datasets and by building a representative dataset with 87 features spanning URL-based, content-based, and external-based classes. Through a series of experiments on this dataset, Random Forest consistently emerges as the most predictive classifier, with external-service features providing the strongest discriminative power and hybrid feature sets achieving top accuracies around 96–97%. Feature selection via filter methods (notably chi-square with incremental removal) improves performance beyond wrappers, while combining models trained on different feature classes does not beat a single, well-tuned hybrid model. The study also demonstrates practical applicability by releasing dataset/code and illustrating a Chrome plugin, highlighting the approach’s potential to standardize comparisons and track phishing tactics over time.
Abstract
In this paper, we present a general scheme for building reproducible and extensible datasets for website phishing detection. The aim is to (1) enable comparison of systems using different features, (2) overtake the short-lived nature of phishing websites, and (3) keep track of the evolution of phishing tactics. For experimenting the proposed scheme, we start by adopting a refined classification of website phishing features and we systematically select a total of 87 commonly recognized ones, we classify them, and we made them subjects for relevance and runtime analysis. We use the collected set of features to build a dataset in light of the proposed scheme. Thereafter, we use a conceptual replication approach to check the genericity of former findings for the built dataset. Specifically, we evaluate the performance of classifiers on individual classes and on combinations of classes, we investigate different combinations of models, and we explore the effects of filter and wrapper methods on the selection of discriminative features. The results show that Random Forest is the most predictive classifier. Features gathered from external services are found the most discriminative where features extracted from web page contents are found less distinguishing. Besides external service based features, some web page content features are found time consuming and not suitable for runtime detection. The use of hybrid features provided the best accuracy score of 96.61%. By investigating different feature selection methods, filter-based ranking together with incremental removal of less important features improved the performance up to 96.83% better than wrapper methods.
