A Study of Effectiveness of Brand Domain Identification Features for Phishing Detection in 2025
Rina Mishra, Gaurav Varshney
TL;DR
Phishing remains a critical security threat, motivating precise detection using Brand Domain Identification (BDI) features. The study systematically evaluates tightly bound domain features (TBDF) across multiple classifiers, collecting a 9,228-site dataset (4,667 legitimate and 4,561 phishing) and testing feature combinations. It finds that a compact three-feature set yields 99.8% accuracy with RF and that all features enable 99.8% with MLP or XGBoost, supporting real-time deployment. The work highlights the practical viability of BDI-based detection and outlines future work on real-time systems, cross-domain fraud detection, and explainable AI enhancements.
Abstract
Phishing websites continue to pose a significant security challenge, making the development of robust detection mechanisms essential. Brand Domain Identification (BDI) serves as a crucial step in many phishing detection approaches. This study systematically evaluates the effectiveness of features employed over the past decade for BDI, focusing on their weighted importance in phishing detection as of 2025. The primary objective is to determine whether the identified brand domain matches the claimed domain, utilizing popular features for phishing detection. To validate feature importance and evaluate performance, we conducted two experiments on a dataset comprising 4,667 legitimate sites and 4,561 phishing sites. In Experiment 1, we used the Weka tool to identify optimized and important feature sets out of 5: CN Information(CN), Logo Domain(LD),Form Action Domain(FAD),Most Common Link in Domain(MCLD) and Cookie Domain through its 4 Attribute Ranking Evaluator. The results revealed that none of the features were redundant, and Random Forest emerged as the best classifier, achieving an impressive accuracy of 99.7\% with an average response time of 0.08 seconds. In Experiment 2, we trained five machine learning models, including Random Forest, Decision Tree, Support Vector Machine, Multilayer Perceptron, and XGBoost to assess the performance of individual BDI features and their combinations. The results demonstrated an accuracy of 99.8\%, achieved with feature combinations of only three features: Most Common Link Domain, Logo Domain, Form Action and Most Common Link Domain,CN Info,Logo Domain using Random Forest as the best classifier. This study underscores the importance of leveraging key domain features for efficient phishing detection and paves the way for the development of real-time, scalable detection systems.
