PhishSSL: Self-Supervised Contrastive Learning for Phishing Website Detection
Wenhao Li, Selvakumar Manickam, Yung-Wey Chong, Shankar Karuppayah, Priyadarsi Nanda, Binyong Li
TL;DR
This work tackles phishing website detection without relying on labeled data by introducing PhishSSL, a self-supervised contrastive learning framework that learns discriminative embeddings from unlabeled tabular website features. It combines hybrid tabular augmentation with an adaptive feature-attention encoder and a triplet-margin loss to produce robust representations, enabling prototype-based inference without supervision. Across three diverse datasets, PhishSSL consistently outperforms unsupervised and self-supervised baselines and demonstrates strong generalization through stable ROC AUC and F1 scores, as well as clear separation of phishing and legitimate samples in the embedding space. The approach offers a practical path toward robust, adaptable phishing defenses in dynamic Web environments, with potential extensions to multi-modal data fusion and adversarial robustness.
Abstract
Phishing websites remain a persistent cybersecurity threat by mimicking legitimate sites to steal sensitive user information. Existing machine learning-based detection methods often rely on supervised learning with labeled data, which not only incurs substantial annotation costs but also limits adaptability to novel attack patterns. To address these challenges, we propose PhishSSL, a self-supervised contrastive learning framework that eliminates the need for labeled phishing data during training. PhishSSL combines hybrid tabular augmentation with adaptive feature attention to produce semantically consistent views and emphasize discriminative attributes. We evaluate PhishSSL on three phishing datasets with distinct feature compositions. Across all datasets, PhishSSL consistently outperforms unsupervised and self-supervised baselines, while ablation studies confirm the contribution of each component. Moreover, PhishSSL maintains robust performance despite the diversity of feature sets, highlighting its strong generalization and transferability. These results demonstrate that PhishSSL offers a promising solution for phishing website detection, particularly effective against evolving threats in dynamic Web environments.
