Table of Contents
Fetching ...

Can AI Models be Jailbroken to Phish Elderly Victims? An End-to-End Evaluation

Fred Heiding, Simon Lermen

TL;DR

This work investigates whether contemporary AI safety guardrails guard against AI-generated phishing aimed at the elderly. It combines an end-to-end safety guardrail evaluation across six frontier LLMs with a human validation study involving $108$ seniors, finding a real-world compromise rate of $11\%$ across 268 emails. The results show substantial model-specific vulnerability, with a notable ability for attackers to bypass safeguards and scale phishing via multi-turn, cross-lingual interactions, underscoring the need for stronger governance and technical countermeasures such as digital identity authentication. The study emphasizes the urgency of industry-wide, standardized safety protocols as AI systems become capable of sophisticated social-engineering campaigns.

Abstract

We present an end-to-end demonstration of how attackers can exploit AI safety failures to harm vulnerable populations: from jailbreaking LLMs to generate phishing content, to deploying those messages against real targets, to successfully compromising elderly victims. We systematically evaluated safety guardrails across six frontier LLMs spanning four attack categories, revealing critical failures where several models exhibited near-complete susceptibility to certain attack vectors. In a human validation study with 108 senior volunteers, AI-generated phishing emails successfully compromised 11\% of participants. Our work uniquely demonstrates the complete attack pipeline targeting elderly populations, highlighting that current AI safety measures fail to protect those most vulnerable to fraud. Beyond generating phishing content, LLMs enable attackers to overcome language barriers and conduct multi-turn trust-building conversations at scale, fundamentally transforming fraud economics. While some providers report voluntary counter-abuse efforts, we argue these remain insufficient.

Can AI Models be Jailbroken to Phish Elderly Victims? An End-to-End Evaluation

TL;DR

This work investigates whether contemporary AI safety guardrails guard against AI-generated phishing aimed at the elderly. It combines an end-to-end safety guardrail evaluation across six frontier LLMs with a human validation study involving seniors, finding a real-world compromise rate of across 268 emails. The results show substantial model-specific vulnerability, with a notable ability for attackers to bypass safeguards and scale phishing via multi-turn, cross-lingual interactions, underscoring the need for stronger governance and technical countermeasures such as digital identity authentication. The study emphasizes the urgency of industry-wide, standardized safety protocols as AI systems become capable of sophisticated social-engineering campaigns.

Abstract

We present an end-to-end demonstration of how attackers can exploit AI safety failures to harm vulnerable populations: from jailbreaking LLMs to generate phishing content, to deploying those messages against real targets, to successfully compromising elderly victims. We systematically evaluated safety guardrails across six frontier LLMs spanning four attack categories, revealing critical failures where several models exhibited near-complete susceptibility to certain attack vectors. In a human validation study with 108 senior volunteers, AI-generated phishing emails successfully compromised 11\% of participants. Our work uniquely demonstrates the complete attack pipeline targeting elderly populations, highlighting that current AI safety measures fail to protect those most vulnerable to fraud. Beyond generating phishing content, LLMs enable attackers to overcome language barriers and conduct multi-turn trust-building conversations at scale, fundamentally transforming fraud economics. While some providers report voluntary counter-abuse efforts, we argue these remain insufficient.

Paper Structure

This paper contains 22 sections, 3 figures.

Figures (3)

  • Figure 1: Example of Meta AI generating phishing content without refusal. The prompt explicitly states malicious intent ("scam users for money"), yet the model complied fully.
  • Figure 2: Attack success rate by model and attack category. Higher success rates indicate lower safety performance.
  • Figure 3: Outcomes of our phishing study grouped by LLM model. Error bars represent 95% Wilson confidence intervals wilson1927probable.