Table of Contents
Fetching ...

Evaluating the Performance of ChatGPT for Spam Email Detection

Shijing Si, Yuwei Wu, Le Tang, Yugui Zhang, Jedrek Wosik, Qinliang Su

TL;DR

This study evaluates ChatGPT for spam email detection in English and Chinese datasets using in-context learning with zero- and few-shot prompts, comparing against Naive Bayes, SVM, LR, DNN, and BERT. It finds that ChatGPT underperforms relative to deep supervised models on a large English dataset but excels in a low-resource Chinese setting, particularly with five-shot prompts. The results underscore the potential and limits of LLM-based spam detection, especially for resource-constrained languages, and emphasize prompt design as a critical driver of performance. Overall, the work provides practical insights into deploying ChatGPT for cybersecurity tasks where language resources are unevenly distributed and highlights directions for future improvement with larger models and refined prompts.

Abstract

Email continues to be a pivotal and extensively utilized communication medium within professional and commercial domains. Nonetheless, the prevalence of spam emails poses a significant challenge for users, disrupting their daily routines and diminishing productivity. Consequently, accurately identifying and filtering spam based on content has become crucial for cybersecurity. Recent advancements in natural language processing, particularly with large language models like ChatGPT, have shown remarkable performance in tasks such as question answering and text generation. However, its potential in spam identification remains underexplored. To fill in the gap, this study attempts to evaluate ChatGPT's capabilities for spam identification in both English and Chinese email datasets. We employ ChatGPT for spam email detection using in-context learning, which requires a prompt instruction with (or without) a few demonstrations. We also investigate how the number of demonstrations in the prompt affects the performance of ChatGPT. For comparison, we also implement five popular benchmark methods, including naive Bayes, support vector machines (SVM), logistic regression (LR), feedforward dense neural networks (DNN), and BERT classifiers. Through extensive experiments, the performance of ChatGPT is significantly worse than deep supervised learning methods in the large English dataset, while it presents superior performance on the low-resourced Chinese dataset. This study provides insights into the potential and limitations of ChatGPT for spam identification, highlighting its potential as a viable solution for resource-constrained language domains.

Evaluating the Performance of ChatGPT for Spam Email Detection

TL;DR

This study evaluates ChatGPT for spam email detection in English and Chinese datasets using in-context learning with zero- and few-shot prompts, comparing against Naive Bayes, SVM, LR, DNN, and BERT. It finds that ChatGPT underperforms relative to deep supervised models on a large English dataset but excels in a low-resource Chinese setting, particularly with five-shot prompts. The results underscore the potential and limits of LLM-based spam detection, especially for resource-constrained languages, and emphasize prompt design as a critical driver of performance. Overall, the work provides practical insights into deploying ChatGPT for cybersecurity tasks where language resources are unevenly distributed and highlights directions for future improvement with larger models and refined prompts.

Abstract

Email continues to be a pivotal and extensively utilized communication medium within professional and commercial domains. Nonetheless, the prevalence of spam emails poses a significant challenge for users, disrupting their daily routines and diminishing productivity. Consequently, accurately identifying and filtering spam based on content has become crucial for cybersecurity. Recent advancements in natural language processing, particularly with large language models like ChatGPT, have shown remarkable performance in tasks such as question answering and text generation. However, its potential in spam identification remains underexplored. To fill in the gap, this study attempts to evaluate ChatGPT's capabilities for spam identification in both English and Chinese email datasets. We employ ChatGPT for spam email detection using in-context learning, which requires a prompt instruction with (or without) a few demonstrations. We also investigate how the number of demonstrations in the prompt affects the performance of ChatGPT. For comparison, we also implement five popular benchmark methods, including naive Bayes, support vector machines (SVM), logistic regression (LR), feedforward dense neural networks (DNN), and BERT classifiers. Through extensive experiments, the performance of ChatGPT is significantly worse than deep supervised learning methods in the large English dataset, while it presents superior performance on the low-resourced Chinese dataset. This study provides insights into the potential and limitations of ChatGPT for spam identification, highlighting its potential as a viable solution for resource-constrained language domains.
Paper Structure (18 sections, 7 equations, 4 figures, 3 tables)

This paper contains 18 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison of performance of different methods on the English ESD dataset. The four evaluation metrics are macro-level precision, recall, F1 score and accuracy.
  • Figure 2: Comparison of performance of different methods on the low-resourced Chinese CSD dataset. The four evaluation metrics are macro-level precision, recall, F1 score and accuracy.
  • Figure 3: The performance of ChatGPT versus number of instances in prompts on the English ESD dataset
  • Figure 4: The performance of ChatGPT versus number of instances in prompts on the Chinese CSD dataset