Table of Contents
Fetching ...

Next-Generation Phishing: How LLM Agents Empower Cyber Attackers

Khalifa Afane, Wenqi Wei, Ying Mao, Junaid Farooq, Juntao Chen

TL;DR

A comprehensive evaluation of traditional phishing detectors, as well as machine learning models like SVM, Logistic Regression, and Naive Bayes, in identifying both traditional and LLM-rephrased phishing emails reveals notable declines in detection accuracy for rephrased emails.

Abstract

The escalating threat of phishing emails has become increasingly sophisticated with the rise of Large Language Models (LLMs). As attackers exploit LLMs to craft more convincing and evasive phishing emails, it is crucial to assess the resilience of current phishing defenses. In this study we conduct a comprehensive evaluation of traditional phishing detectors, such as Gmail Spam Filter, Apache SpamAssassin, and Proofpoint, as well as machine learning models like SVM, Logistic Regression, and Naive Bayes, in identifying both traditional and LLM-rephrased phishing emails. We also explore the emerging role of LLMs as phishing detection tools, a method already adopted by companies like NTT Security Holdings and JPMorgan Chase. Our results reveal notable declines in detection accuracy for rephrased emails across all detectors, highlighting critical weaknesses in current phishing defenses. As the threat landscape evolves, our findings underscore the need for stronger security controls and regulatory oversight on LLM-generated content to prevent its misuse in creating advanced phishing attacks. This study contributes to the development of more effective Cyber Threat Intelligence (CTI) by leveraging LLMs to generate diverse phishing variants that can be used for data augmentation, harnessing the power of LLMs to enhance phishing detection, and paving the way for more robust and adaptable threat detection systems.

Next-Generation Phishing: How LLM Agents Empower Cyber Attackers

TL;DR

A comprehensive evaluation of traditional phishing detectors, as well as machine learning models like SVM, Logistic Regression, and Naive Bayes, in identifying both traditional and LLM-rephrased phishing emails reveals notable declines in detection accuracy for rephrased emails.

Abstract

The escalating threat of phishing emails has become increasingly sophisticated with the rise of Large Language Models (LLMs). As attackers exploit LLMs to craft more convincing and evasive phishing emails, it is crucial to assess the resilience of current phishing defenses. In this study we conduct a comprehensive evaluation of traditional phishing detectors, such as Gmail Spam Filter, Apache SpamAssassin, and Proofpoint, as well as machine learning models like SVM, Logistic Regression, and Naive Bayes, in identifying both traditional and LLM-rephrased phishing emails. We also explore the emerging role of LLMs as phishing detection tools, a method already adopted by companies like NTT Security Holdings and JPMorgan Chase. Our results reveal notable declines in detection accuracy for rephrased emails across all detectors, highlighting critical weaknesses in current phishing defenses. As the threat landscape evolves, our findings underscore the need for stronger security controls and regulatory oversight on LLM-generated content to prevent its misuse in creating advanced phishing attacks. This study contributes to the development of more effective Cyber Threat Intelligence (CTI) by leveraging LLMs to generate diverse phishing variants that can be used for data augmentation, harnessing the power of LLMs to enhance phishing detection, and paving the way for more robust and adaptable threat detection systems.

Paper Structure

This paper contains 15 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Evaluation methodology workflow, highlighting the differences in detection effectiveness on average between traditional and LLM-rephrased emails.
  • Figure 2: Depiction of the decision boundary shift between traditional phishing emails and LLM-rephrased phishing emails in terms of classification probability.
  • Figure 3: Classification Results for 5 Original Emails by Llama 3
  • Figure 4: Classification Results for 5 Rephrased Emails by Llama 3 (Few-Shot Prompting)
  • Figure 5: Accuracy Comparison of SVM, Naive Bayes, and Logistic Regression in Detecting Rephrased Emails: Traditional vs. LLM-Augmented Datasets.