Assessing AI vs Human-Authored Spear Phishing SMS Attacks: An Empirical Study

Jerson Francia; Derek Hansen; Ben Schooley; Matthew Taylor; Shydra Murray; Greg Snow

Assessing AI vs Human-Authored Spear Phishing SMS Attacks: An Empirical Study

Jerson Francia, Derek Hansen, Ben Schooley, Matthew Taylor, Shydra Murray, Greg Snow

TL;DR

The paper addresses the risk of AI-enabled spear phishing by empirically comparing AI-generated versus human-authored SMS messages using the TRAPD framework. It finds that AI-generated content is often as persuasive as, and sometimes more effective than, human-crafted messages, particularly for job-related topics, though differences are not consistently statistically significant due to limited sample size. TRAPD combines threshold-based ranking with qualitative feedback to illuminate why messages deceive and how recipients judge authorship, revealing that detection of AI-generated content remains challenging for users. The results underscore the need for improved countermeasures, including detection tools and targeted training, to mitigate evolving AI-driven social engineering threats in real-world settings.

Abstract

This paper explores the use of Large Language Models (LLMs) in spear phishing message generation and evaluates their performance compared to human-authored counterparts. Our pilot study examines the effectiveness of smishing (SMS phishing) messages created by GPT-4 and human authors, which have been personalized for willing targets. The targets assessed these messages in a modified ranked-order experiment using a novel methodology we call TRAPD (Threshold Ranking Approach for Personalized Deception). Experiments involved ranking each spear phishing message from most to least convincing, providing qualitative feedback, and guessing which messages were human- or AI-generated. Results show that LLM-generated messages are often perceived as more convincing than those authored by humans, particularly job-related messages. Targets also struggled to distinguish between human- and AI-generated messages. We analyze different criteria the targets used to assess the persuasiveness and source of messages. This study aims to highlight the urgent need for further research and improved countermeasures against personalized AI-enabled social engineering attacks.

Assessing AI vs Human-Authored Spear Phishing SMS Attacks: An Empirical Study

TL;DR

Abstract

Paper Structure (52 sections, 7 figures, 4 tables)

This paper contains 52 sections, 7 figures, 4 tables.

Introduction
Review of Related Literature
Evolution of Phishing Techniques
AI-enabled Phishing
Comparison of Human vs AI in Phishing
Research Questions
Methodology
The TRAPD Methodology
Recruiting targets who share personal information
Creating personalized deceptive messages
Human Generation
AI Generation
Target Interview and Sorting Activity
Threshold rank order
Qualitative Assessment
...and 37 more sections

Figures (7)

Figure 1: Overview of the project within the context of the TRAPD Methodology steps. Recruitment, Generation and Assessment is numbered according to each step in the methodology.
Figure 2: Target Demographics (n=25).
Figure 3: A sample of the threshold rank ordering used in the target interview. After ranking the messages from most (red) to least (blue) likely, a marker (yellow) is placed to indicate the threshold for clicking.
Figure 4: An example on the AI labeling phase of the target interview.
Figure 5: Rank Distribution from each source. On average, AI-generated messages ranked slightly higher than human-generated messages.
...and 2 more figures

Assessing AI vs Human-Authored Spear Phishing SMS Attacks: An Empirical Study

TL;DR

Abstract

Assessing AI vs Human-Authored Spear Phishing SMS Attacks: An Empirical Study

Authors

TL;DR

Abstract

Table of Contents

Figures (7)