NNG-Mix: Improving Semi-supervised Anomaly Detection with Pseudo-anomaly Generation

Hao Dong; Gaëtan Frusque; Yue Zhao; Eleni Chatzi; Olga Fink

NNG-Mix: Improving Semi-supervised Anomaly Detection with Pseudo-anomaly Generation

Hao Dong, Gaëtan Frusque, Yue Zhao, Eleni Chatzi, Olga Fink

TL;DR

A novel algorithm for generating additional pseudo-anomalies on the basis of the limited labeled anomalies and a large volume of unlabeled data, named nearest neighbor Gaussian mix-up (NNG-Mix), efficiently integrates information from both labeled and unlabeled data to generate pseudo-anomalies.

Abstract

Anomaly detection (AD) is essential in identifying rare and often critical events in complex systems, finding applications in fields such as network intrusion detection, financial fraud detection, and fault detection in infrastructure and industrial systems. While AD is typically treated as an unsupervised learning task due to the high cost of label annotation, it is more practical to assume access to a small set of labeled anomaly samples from domain experts, as is the case for semi-supervised anomaly detection. Semi-supervised and supervised approaches can leverage such labeled data, resulting in improved performance. In this paper, rather than proposing a new semi-supervised or supervised approach for AD, we introduce a novel algorithm for generating additional pseudo-anomalies on the basis of the limited labeled anomalies and a large volume of unlabeled data. This serves as an augmentation to facilitate the detection of new anomalies. Our proposed algorithm, named Nearest Neighbor Gaussian Mixup (NNG-Mix), efficiently integrates information from both labeled and unlabeled data to generate pseudo-anomalies. We compare the performance of this novel algorithm with commonly applied augmentation techniques, such as Mixup and Cutout. We evaluate NNG-Mix by training various existing semi-supervised and supervised anomaly detection algorithms on the original training data along with the generated pseudo-anomalies. Through extensive experiments on 57 benchmark datasets in ADBench, reflecting different data types, we demonstrate that NNG-Mix outperforms other data augmentation methods. It yields significant performance improvements compared to the baselines trained exclusively on the original training data. Notably, NNG-Mix yields up to 16.4%, 8.8%, and 8.0% improvements on Classical, CV, and NLP datasets in ADBench. Our source code is available at https://github.com/donghao51/NNG-Mix.

NNG-Mix: Improving Semi-supervised Anomaly Detection with Pseudo-anomaly Generation

TL;DR

Abstract

Paper Structure (22 sections, 5 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 5 equations, 9 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Methodology
Preliminaries and Problem Definition
Baseline Pseudo-anomaly Generation Methods
Nearest Neighbor Gaussian Mixup
Experiments
Experimental Setup
Results
Ablations
Ablations on Each Module in NNG-Mix
Ablations on Different Numbers of Generated Pseudo-anomalies
Ablations on Different Quantities of Labeled Anomalies
Ablations on Different Pollution Ratios
Combination of Different Pseudo-Anomaly Generation Algorithms
...and 7 more sections

Figures (9)

Figure 1: Pseudo-anomaly generation using Mixup zhang2018mixup and our proposed NNG-Mix. Mixup using all training data introduces noise samples within the distribution of unlabeled data, while using only labeled anomalies underestimates the information from unlabeled data and also injects some noise samples that are within unlabeled data. In contrast, our NNG-Mix fully exploits the information from both labeled anomalies and unlabeled data samples to generate pseudo-anomalies effectively.
Figure 2: Nearest Neighbor Gaussian Mixup makes good use of information from both labeled anomalies and unlabeled data to generate pseudo-anomalies effectively.
Figure 3: Ablations on different numbers of generated pseudo-anomalies and labeled anomalies. Generating more pseudo-anomalies brings performance improvement in general but tends to plateau when $M$ exceeds $10$. With the availability of more labeled anomalies, the performances of all algorithms are improved significantly.
Figure 4: AUCROC with different pollution ratios. Lower pollution ratios consistently lead to performance improvements across all algorithms. Notably, for DeepSAD, MLP, and FTTransformer, NNG-Mix substantially enhances the Baseline setups, regardless of the pollution ratio. In contrast, for XGBOD and CatB, when the pollution ratio is low, the Baselines already exhibit superior performance, and the introduction of additional pseudo-anomalies does not yield significant benefits. NNG-Mix demonstrates its efficacy primarily when the pollution ratio surpasses the $0.6$ threshold.
Figure 5: Parameter Sensitivity. NNG-Mix demonstrates robustness across different parameter settings, displaying fluctuations of less than $2\%$ in most cases.
...and 4 more figures

NNG-Mix: Improving Semi-supervised Anomaly Detection with Pseudo-anomaly Generation

TL;DR

Abstract

NNG-Mix: Improving Semi-supervised Anomaly Detection with Pseudo-anomaly Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)