Multilingual Email Phishing Attacks Detection using OSINT and Machine Learning
Panharith An, Rana Shafi, Tionge Mughogho, Onyango Allan Onyango
TL;DR
This work tackles multilingual email phishing detection by integrating OSINT-derived features from Nmap and theHarvester with ML classifiers across English and Arabic datasets. It constructs OSINT-augmented English and Arabic data, trains five classifiers (DT, RF, SVM, XGBoost, Multinomial NB) with grid-optimized hyperparameters, and evaluates performance using accuracy and other metrics. Random Forest emerges as the top performer, reaching up to 97.37% accuracy on both languages, demonstrating that OSINT features can boost detection across languages. The study provides a reproducible OSINT-assisted ML pipeline for multilingual phishing detection, discusses model-dependent effects (e.g., NB underperforms with correlated features), and outlines future directions including larger datasets and transformer-based multilingual models for improved generalization.
Abstract
Email phishing remains a prevalent cyber threat, targeting victims to extract sensitive information or deploy malicious software. This paper explores the integration of open-source intelligence (OSINT) tools and machine learning (ML) models to enhance phishing detection across multilingual datasets. Using Nmap and theHarvester, this study extracted 17 features, including domain names, IP addresses, and open ports, to improve detection accuracy. Multilingual email datasets, including English and Arabic, were analyzed to address the limitations of ML models trained predominantly on English data. Experiments with five classification algorithms: Decision Tree, Random Forest, Support Vector Machine, XGBoost, and Multinomial Naïve Bayes. It revealed that Random Forest achieved the highest performance, with an accuracy of 97.37% for both English and Arabic datasets. For OSINT-enhanced datasets, the model demonstrated an improvement in accuracy compared to baseline models without OSINT features. These findings highlight the potential of combining OSINT tools with advanced ML models to detect phishing emails more effectively across diverse languages and contexts. This study contributes an approach to phishing detection by incorporating OSINT features and evaluating their impact on multilingual datasets, addressing a critical gap in cybersecurity research.
