MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection
Paulo Mendes, Eva Maia, Isabel Praça
TL;DR
Phishing remains a critical cybersecurity threat and ML effectiveness hinges on the quality and diversity of training data. The authors construct MeAJOR Corpus by merging five open-source phishing datasets, applying consistent preprocessing and extensive feature engineering across body text, URLs, attachments, headers, HTML, and domain reputation, and evaluate four ML/DL models across four feature sets. Their results show XGBoost achieving up to 98.34% F1 on the Text+URL configuration, with URL features driving substantial gains and the Text-only baseline remaining strong, demonstrating the dataset's robustness and versatility. MeAJOR advances reproducibility and generalisability in phishing detection and enables multi-modal and future large-language-model approaches in practical defense pipelines.
Abstract
Phishing emails continue to pose a significant threat to cybersecurity by exploiting human vulnerabilities through deceptive content and malicious payloads. While Machine Learning (ML) models are effective at detecting phishing threats, their performance largely relies on the quality and diversity of the training data. This paper presents MeAJOR (Merged email Assets from Joint Open-source Repositories) Corpus, a novel, multi-source phishing email dataset designed to overcome critical limitations in existing resources. It integrates 135894 samples representing a broad number of phishing tactics and legitimate emails, with a wide spectrum of engineered features. We evaluated the dataset's utility for phishing detection research through systematic experiments with four classification models (RF, XGB, MLP, and CNN) across multiple feature configurations. Results highlight the dataset's effectiveness, achieving 98.34% F1 with XGB. By integrating broad features from multiple categories, our dataset provides a reusable and consistent resource, while addressing common challenges like class imbalance, generalisability and reproducibility.
