Deep Anomaly Detection in Text
Andrei Manolache
TL;DR
The thesis addresses anomaly detection in text by proposing DATE, a transformer-based end-to-end method that leverages self-supervised pretext tasks to produce a robust anomaly score. DATE uses two pretext tasks, Replaced Mask Detection and Replaced Token Detection, within a generator–discriminator framework inspired by ELECTRA, and introduces a computationally efficient Pseudo Label score for inference. Across 20Newsgroups and AG News, DATE achieves state-of-the-art semi-supervised and unsupervised results, outperforming classical baselines (OC-SVM, CVDD, SVDD) and deep competitors (E3 Outlier variants). The work demonstrates strong text AD performance, offers per-token anomaly explanations, and suggests broad future directions in self-supervised objectives and contrastive learning for textual anomaly detection. These contributions advance practical, scalable anomaly detection for NLP, with potential extensions to authorship and stylistic analysis.
Abstract
Deep anomaly detection methods have become increasingly popular in recent years, with methods like Stacked Autoencoders, Variational Autoencoders, and Generative Adversarial Networks greatly improving the state-of-the-art. Other methods rely on augmenting classical models (such as the One-Class Support Vector Machine), by learning an appropriate kernel function using Neural Networks. Recent developments in representation learning by self-supervision are proving to be very beneficial in the context of anomaly detection. Inspired by the advancements in anomaly detection using self-supervised learning in the field of computer vision, this thesis aims to develop a method for detecting anomalies by exploiting pretext tasks tailored for text corpora. This approach greatly improves the state-of-the-art on two datasets, 20Newsgroups, and AG News, for both semi-supervised and unsupervised anomaly detection, thus proving the potential for self-supervised anomaly detectors in the field of natural language processing.
