Table of Contents
Fetching ...

Exposing LLM Vulnerabilities: Adversarial Scam Detection and Performance

Chen-Wei Chang, Shailik Sarkar, Shutonu Mitra, Qi Zhang, Hossein Salemi, Hemant Purohit, Fengxiu Zhang, Michin Hong, Jin-Hee Cho, Chang-Tien Lu

TL;DR

This work investigates the vulnerability of LLM-based scam detectors to adversarial scam messages. It builds a dataset with fine-grained labels, including original and adversarial examples generated via GPT-4 prompts, and benchmarks GPT-3.5 Turbo, Claude3-haiku, and LLaMA 3.1 8B Instruct under zero-shot and few-shot settings across Romance, Finance, and Recruitment scams. Results show that adversarial modifications significantly reduce accuracy, with larger models like GPT-3.5Turbo exhibiting greater robustness than smaller ones, while Romance scams are most susceptible. The authors propose mitigation strategies such as adversarial prompting and targeted training to enhance robustness, underscoring the need for ongoing defenses in security-critical NLP applications.

Abstract

Can we trust Large Language Models (LLMs) to accurately predict scam? This paper investigates the vulnerabilities of LLMs when facing adversarial scam messages for the task of scam detection. We addressed this issue by creating a comprehensive dataset with fine-grained labels of scam messages, including both original and adversarial scam messages. The dataset extended traditional binary classes for the scam detection task into more nuanced scam types. Our analysis showed how adversarial examples took advantage of vulnerabilities of a LLM, leading to high misclassification rate. We evaluated the performance of LLMs on these adversarial scam messages and proposed strategies to improve their robustness.

Exposing LLM Vulnerabilities: Adversarial Scam Detection and Performance

TL;DR

This work investigates the vulnerability of LLM-based scam detectors to adversarial scam messages. It builds a dataset with fine-grained labels, including original and adversarial examples generated via GPT-4 prompts, and benchmarks GPT-3.5 Turbo, Claude3-haiku, and LLaMA 3.1 8B Instruct under zero-shot and few-shot settings across Romance, Finance, and Recruitment scams. Results show that adversarial modifications significantly reduce accuracy, with larger models like GPT-3.5Turbo exhibiting greater robustness than smaller ones, while Romance scams are most susceptible. The authors propose mitigation strategies such as adversarial prompting and targeted training to enhance robustness, underscoring the need for ongoing defenses in security-critical NLP applications.

Abstract

Can we trust Large Language Models (LLMs) to accurately predict scam? This paper investigates the vulnerabilities of LLMs when facing adversarial scam messages for the task of scam detection. We addressed this issue by creating a comprehensive dataset with fine-grained labels of scam messages, including both original and adversarial scam messages. The dataset extended traditional binary classes for the scam detection task into more nuanced scam types. Our analysis showed how adversarial examples took advantage of vulnerabilities of a LLM, leading to high misclassification rate. We evaluated the performance of LLMs on these adversarial scam messages and proposed strategies to improve their robustness.

Paper Structure

This paper contains 10 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Experimental procedures, including data collection, annotation, LLM testing, and result analysis.