Table of Contents
Fetching ...

NLP-ADBench: NLP Anomaly Detection Benchmark

Yuangang Li, Jiaqi Li, Zhuo Xiao, Tiankai Yang, Yi Nian, Xiyang Hu, Yue Zhao

TL;DR

NLP-ADBench tackles the challenge of NLP anomaly detection by delivering a comprehensive benchmark that spans eight real-world NLP datasets and 19 algorithms, including end-to-end and two-step approaches using transformer embeddings. The findings show no single model dominates, but two-step methods that leverage transformer-based embeddings, especially OpenAI, often outperform end-to-end systems, though dataset characteristics heavily influence results and higher-dimensional embeddings incur computational costs. The work highlights the need for automated model selection, embedding integration for end-to-end methods, and NLP-AD-specific dimensionality reduction, while providing an open-source framework to drive future research and practical adoption. Overall, NLP-ADBench establishes a rigorous standard for evaluating NLP anomaly detection and informs practical deployment decisions in fraud detection, content moderation, and spam analysis.

Abstract

Anomaly detection (AD) is an important machine learning task with applications in fraud detection, content moderation, and user behavior analysis. However, AD is relatively understudied in a natural language processing (NLP) context, limiting its effectiveness in detecting harmful content, phishing attempts, and spam reviews. We introduce NLP-ADBench, the most comprehensive NLP anomaly detection (NLP-AD) benchmark to date, which includes eight curated datasets and 19 state-of-the-art algorithms. These span 3 end-to-end methods and 16 two-step approaches that adapt classical, non-AD methods to language embeddings from BERT and OpenAI. Our empirical results show that no single model dominates across all datasets, indicating a need for automated model selection. Moreover, two-step methods with transformer-based embeddings consistently outperform specialized end-to-end approaches, with OpenAI embeddings outperforming those of BERT. We release NLP-ADBench at https://github.com/USC-FORTIS/NLP-ADBench, providing a unified framework for NLP-AD and supporting future investigations.

NLP-ADBench: NLP Anomaly Detection Benchmark

TL;DR

NLP-ADBench tackles the challenge of NLP anomaly detection by delivering a comprehensive benchmark that spans eight real-world NLP datasets and 19 algorithms, including end-to-end and two-step approaches using transformer embeddings. The findings show no single model dominates, but two-step methods that leverage transformer-based embeddings, especially OpenAI, often outperform end-to-end systems, though dataset characteristics heavily influence results and higher-dimensional embeddings incur computational costs. The work highlights the need for automated model selection, embedding integration for end-to-end methods, and NLP-AD-specific dimensionality reduction, while providing an open-source framework to drive future research and practical adoption. Overall, NLP-ADBench establishes a rigorous standard for evaluating NLP anomaly detection and informs practical deployment decisions in fraud detection, content moderation, and spam analysis.

Abstract

Anomaly detection (AD) is an important machine learning task with applications in fraud detection, content moderation, and user behavior analysis. However, AD is relatively understudied in a natural language processing (NLP) context, limiting its effectiveness in detecting harmful content, phishing attempts, and spam reviews. We introduce NLP-ADBench, the most comprehensive NLP anomaly detection (NLP-AD) benchmark to date, which includes eight curated datasets and 19 state-of-the-art algorithms. These span 3 end-to-end methods and 16 two-step approaches that adapt classical, non-AD methods to language embeddings from BERT and OpenAI. Our empirical results show that no single model dominates across all datasets, indicating a need for automated model selection. Moreover, two-step methods with transformer-based embeddings consistently outperform specialized end-to-end approaches, with OpenAI embeddings outperforming those of BERT. We release NLP-ADBench at https://github.com/USC-FORTIS/NLP-ADBench, providing a unified framework for NLP-AD and supporting future investigations.

Paper Structure

This paper contains 20 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Average rank on AUROC of 19 NLPAD methods across 8 datasets (the lower the better).
  • Figure A1: Average rank on AUPRC of 19 NLPAD methods across 8 datasets (the lower the better).