Table of Contents
Fetching ...

Constructing and Benchmarking: a Labeled Email Dataset for Text-Based Phishing and Spam Detection Framework

Rebeka Toth, Tamas Bisztray, Richard Dubniczky

TL;DR

This work creates a comprehensive, extensible labeled email dataset containing phishing, spam, and legitimate messages, explicitly distinguishing human- and LLM-generated content and annotating emotional appeal and attacker motivation. It benchmarks multiple LLMs for emotion and motivation labeling and selects Claude 3.5 Sonnet as the annotator, then evaluates a state-of-the-art model on original and rephrased emails to assess robustness. The study finds strong phishing detection but persistent challenges distinguishing spam from legitimate emails, while paraphrasing exerts minimal impact on classifier performance, supporting practical deployment potential. By releasing open-source code, templates, and data processing pipelines, the paper provides a foundation for advancing AI-assisted email security and reproducible research in threat detection and emotion-aware NLP. The dataset enables future benchmarks, robustness studies, and the development of more resilient defenses against AI-enhanced phishing and spam campaigns.

Abstract

Phishing and spam emails remain a major cybersecurity threat, with attackers increasingly leveraging Large Language Models (LLMs) to craft highly deceptive content. This study presents a comprehensive email dataset containing phishing, spam, and legitimate messages, explicitly distinguishing between human- and LLM-generated content. Each email is annotated with its category, emotional appeal (e.g., urgency, fear, authority), and underlying motivation (e.g., link-following, credential theft, financial fraud). We benchmark multiple LLMs on their ability to identify these emotional and motivational cues and select the most reliable model to annotate the full dataset. To evaluate classification robustness, emails were also rephrased using several LLMs while preserving meaning and intent. A state-of-the-art LLM was then assessed on its performance across both original and rephrased emails using expert-labeled ground truth. The results highlight strong phishing detection capabilities but reveal persistent challenges in distinguishing spam from legitimate emails. Our dataset and evaluation framework contribute to improving AI-assisted email security systems. To support open science, all code, templates, and resources are available on our project site.

Constructing and Benchmarking: a Labeled Email Dataset for Text-Based Phishing and Spam Detection Framework

TL;DR

This work creates a comprehensive, extensible labeled email dataset containing phishing, spam, and legitimate messages, explicitly distinguishing human- and LLM-generated content and annotating emotional appeal and attacker motivation. It benchmarks multiple LLMs for emotion and motivation labeling and selects Claude 3.5 Sonnet as the annotator, then evaluates a state-of-the-art model on original and rephrased emails to assess robustness. The study finds strong phishing detection but persistent challenges distinguishing spam from legitimate emails, while paraphrasing exerts minimal impact on classifier performance, supporting practical deployment potential. By releasing open-source code, templates, and data processing pipelines, the paper provides a foundation for advancing AI-assisted email security and reproducible research in threat detection and emotion-aware NLP. The dataset enables future benchmarks, robustness studies, and the development of more resilient defenses against AI-enhanced phishing and spam campaigns.

Abstract

Phishing and spam emails remain a major cybersecurity threat, with attackers increasingly leveraging Large Language Models (LLMs) to craft highly deceptive content. This study presents a comprehensive email dataset containing phishing, spam, and legitimate messages, explicitly distinguishing between human- and LLM-generated content. Each email is annotated with its category, emotional appeal (e.g., urgency, fear, authority), and underlying motivation (e.g., link-following, credential theft, financial fraud). We benchmark multiple LLMs on their ability to identify these emotional and motivational cues and select the most reliable model to annotate the full dataset. To evaluate classification robustness, emails were also rephrased using several LLMs while preserving meaning and intent. A state-of-the-art LLM was then assessed on its performance across both original and rephrased emails using expert-labeled ground truth. The results highlight strong phishing detection capabilities but reveal persistent challenges in distinguishing spam from legitimate emails. Our dataset and evaluation framework contribute to improving AI-assisted email security systems. To support open science, all code, templates, and resources are available on our project site.

Paper Structure

This paper contains 25 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Methodology for dataset building and evaluating.
  • Figure 2: Categorization Prompt