Table of Contents
Fetching ...

PhreshPhish: A Real-World, High-Quality, Large-Scale Phishing Website Dataset and Benchmark

Thomas Dalton, Hemanth Gowda, Girish Rao, Sachin Pargi, Alireza Hadj Khodabakhshi, Joseph Rombs, Stephan Jou, Manish Marwah

TL;DR

PhreshPhish addresses the lack of realistic, large-scale phishing data by providing a large, high-quality phishing webpage dataset collected with a browser-based pipeline and a comprehensive suite of leakage-resistant benchmarks. The dataset is augmented with a rigorous two-stage cleaning process and a test/benchmark design that includes temporal splits, diversity, difficulty, and multiple base rates to reflect real-world conditions. Baseline experiments across linear, FFN, BERT-based (GTE), and LLM models reveal strong performance at high base rates but substantial degradation as base rate lowers, underscoring the need for robust evaluation standards. The dataset and benchmarks are publicly available, enabling standardized comparisons and encouraging advances in phishing detection research.

Abstract

Phishing remains a pervasive and growing threat, inflicting heavy economic and reputational damage. While machine learning has been effective in real-time detection of phishing attacks, progress is hindered by lack of large, high-quality datasets and benchmarks. In addition to poor-quality due to challenges in data collection, existing datasets suffer from leakage and unrealistic base rates, leading to overly optimistic performance results. In this paper, we introduce PhreshPhish, a large-scale, high-quality dataset of phishing websites that addresses these limitations. Compared to existing public datasets, PhreshPhish is substantially larger and provides significantly higher quality, as measured by the estimated rate of invalid or mislabeled data points. Additionally, we propose a comprehensive suite of benchmark datasets specifically designed for realistic model evaluation by minimizing leakage, increasing task difficulty, enhancing dataset diversity, and adjustment of base rates more likely to be seen in the real world. We train and evaluate multiple solution approaches to provide baseline performance on the benchmark sets. We believe the availability of this dataset and benchmarks will enable realistic, standardized model comparison and foster further advances in phishing detection. The datasets and benchmarks are available on Hugging Face (https://huggingface.co/datasets/phreshphish/phreshphish).

PhreshPhish: A Real-World, High-Quality, Large-Scale Phishing Website Dataset and Benchmark

TL;DR

PhreshPhish addresses the lack of realistic, large-scale phishing data by providing a large, high-quality phishing webpage dataset collected with a browser-based pipeline and a comprehensive suite of leakage-resistant benchmarks. The dataset is augmented with a rigorous two-stage cleaning process and a test/benchmark design that includes temporal splits, diversity, difficulty, and multiple base rates to reflect real-world conditions. Baseline experiments across linear, FFN, BERT-based (GTE), and LLM models reveal strong performance at high base rates but substantial degradation as base rate lowers, underscoring the need for robust evaluation standards. The dataset and benchmarks are publicly available, enabling standardized comparisons and encouraging advances in phishing detection research.

Abstract

Phishing remains a pervasive and growing threat, inflicting heavy economic and reputational damage. While machine learning has been effective in real-time detection of phishing attacks, progress is hindered by lack of large, high-quality datasets and benchmarks. In addition to poor-quality due to challenges in data collection, existing datasets suffer from leakage and unrealistic base rates, leading to overly optimistic performance results. In this paper, we introduce PhreshPhish, a large-scale, high-quality dataset of phishing websites that addresses these limitations. Compared to existing public datasets, PhreshPhish is substantially larger and provides significantly higher quality, as measured by the estimated rate of invalid or mislabeled data points. Additionally, we propose a comprehensive suite of benchmark datasets specifically designed for realistic model evaluation by minimizing leakage, increasing task difficulty, enhancing dataset diversity, and adjustment of base rates more likely to be seen in the real world. We train and evaluate multiple solution approaches to provide baseline performance on the benchmark sets. We believe the availability of this dataset and benchmarks will enable realistic, standardized model comparison and foster further advances in phishing detection. The datasets and benchmarks are available on Hugging Face (https://huggingface.co/datasets/phreshphish/phreshphish).

Paper Structure

This paper contains 25 sections, 1 equation, 14 figures, 5 tables, 3 algorithms.

Figures (14)

  • Figure 1: Our end-to-end pipeline consists of four distinct stages and we show the benign-to-phishing count at each step: (1) benign and phishing HTML is collected from the web using a real browser to ensure high fidelity; (2) the retrieved HTML is cleaned and assessed for quality using a combination of automated heuristics and human annotation; (3) train and test splits are created temporally and pruned to minimize leakage; (4) benchmark datasets are created by applying a set of diversity, difficulty and base rate filters.
  • Figure 2: A one-month snapshot (August 2025) of collected phishing pages. (\ref{['fig:ourdataset:top_targets_daily']}) Most frequently targeted brands over time. Targeted brands follow a power law-like distribution, with a few brands being targeted much more frequently than others. (\ref{['fig:ourdataset:word_cloud']}) The most common domains in phishing URLs are often legitimate domains that allow users to upload and host content such as Vercel and Blogspot.
  • Figure 3: Precision-recall curves on the benchmark datasets
  • Figure 4: The train and test datasets are temporally disjoint for each class.
  • Figure 5: Multiple failure modes are frequently encountered when scraping phishing pages. (\ref{['fig:cannot-resolve']}) DNS resolution issues can prevent the page from being scraped. (\ref{['fig:cloudflare-phishing']}) Security takedown notices can prevent the page from being scraped. (\ref{['fig:shortener-removed']}) URL shortening services will sometimes block access to the intended page. (\ref{['fig:redirect']}) Redirects to the legitimate target brand page can occur resulting in mislabeled data.
  • ...and 9 more figures