Table of Contents
Fetching ...

NNG-Mix: Improving Semi-supervised Anomaly Detection with Pseudo-anomaly Generation

Hao Dong, Gaëtan Frusque, Yue Zhao, Eleni Chatzi, Olga Fink

TL;DR

A novel algorithm for generating additional pseudo-anomalies on the basis of the limited labeled anomalies and a large volume of unlabeled data, named nearest neighbor Gaussian mix-up (NNG-Mix), efficiently integrates information from both labeled and unlabeled data to generate pseudo-anomalies.

Abstract

Anomaly detection (AD) is essential in identifying rare and often critical events in complex systems, finding applications in fields such as network intrusion detection, financial fraud detection, and fault detection in infrastructure and industrial systems. While AD is typically treated as an unsupervised learning task due to the high cost of label annotation, it is more practical to assume access to a small set of labeled anomaly samples from domain experts, as is the case for semi-supervised anomaly detection. Semi-supervised and supervised approaches can leverage such labeled data, resulting in improved performance. In this paper, rather than proposing a new semi-supervised or supervised approach for AD, we introduce a novel algorithm for generating additional pseudo-anomalies on the basis of the limited labeled anomalies and a large volume of unlabeled data. This serves as an augmentation to facilitate the detection of new anomalies. Our proposed algorithm, named Nearest Neighbor Gaussian Mixup (NNG-Mix), efficiently integrates information from both labeled and unlabeled data to generate pseudo-anomalies. We compare the performance of this novel algorithm with commonly applied augmentation techniques, such as Mixup and Cutout. We evaluate NNG-Mix by training various existing semi-supervised and supervised anomaly detection algorithms on the original training data along with the generated pseudo-anomalies. Through extensive experiments on 57 benchmark datasets in ADBench, reflecting different data types, we demonstrate that NNG-Mix outperforms other data augmentation methods. It yields significant performance improvements compared to the baselines trained exclusively on the original training data. Notably, NNG-Mix yields up to 16.4%, 8.8%, and 8.0% improvements on Classical, CV, and NLP datasets in ADBench. Our source code is available at https://github.com/donghao51/NNG-Mix.

NNG-Mix: Improving Semi-supervised Anomaly Detection with Pseudo-anomaly Generation

TL;DR

A novel algorithm for generating additional pseudo-anomalies on the basis of the limited labeled anomalies and a large volume of unlabeled data, named nearest neighbor Gaussian mix-up (NNG-Mix), efficiently integrates information from both labeled and unlabeled data to generate pseudo-anomalies.

Abstract

Anomaly detection (AD) is essential in identifying rare and often critical events in complex systems, finding applications in fields such as network intrusion detection, financial fraud detection, and fault detection in infrastructure and industrial systems. While AD is typically treated as an unsupervised learning task due to the high cost of label annotation, it is more practical to assume access to a small set of labeled anomaly samples from domain experts, as is the case for semi-supervised anomaly detection. Semi-supervised and supervised approaches can leverage such labeled data, resulting in improved performance. In this paper, rather than proposing a new semi-supervised or supervised approach for AD, we introduce a novel algorithm for generating additional pseudo-anomalies on the basis of the limited labeled anomalies and a large volume of unlabeled data. This serves as an augmentation to facilitate the detection of new anomalies. Our proposed algorithm, named Nearest Neighbor Gaussian Mixup (NNG-Mix), efficiently integrates information from both labeled and unlabeled data to generate pseudo-anomalies. We compare the performance of this novel algorithm with commonly applied augmentation techniques, such as Mixup and Cutout. We evaluate NNG-Mix by training various existing semi-supervised and supervised anomaly detection algorithms on the original training data along with the generated pseudo-anomalies. Through extensive experiments on 57 benchmark datasets in ADBench, reflecting different data types, we demonstrate that NNG-Mix outperforms other data augmentation methods. It yields significant performance improvements compared to the baselines trained exclusively on the original training data. Notably, NNG-Mix yields up to 16.4%, 8.8%, and 8.0% improvements on Classical, CV, and NLP datasets in ADBench. Our source code is available at https://github.com/donghao51/NNG-Mix.
Paper Structure (22 sections, 5 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 5 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Pseudo-anomaly generation using Mixup zhang2018mixup and our proposed NNG-Mix. Mixup using all training data introduces noise samples within the distribution of unlabeled data, while using only labeled anomalies underestimates the information from unlabeled data and also injects some noise samples that are within unlabeled data. In contrast, our NNG-Mix fully exploits the information from both labeled anomalies and unlabeled data samples to generate pseudo-anomalies effectively.
  • Figure 2: Nearest Neighbor Gaussian Mixup makes good use of information from both labeled anomalies and unlabeled data to generate pseudo-anomalies effectively.
  • Figure 3: Ablations on different numbers of generated pseudo-anomalies and labeled anomalies. Generating more pseudo-anomalies brings performance improvement in general but tends to plateau when $M$ exceeds $10$. With the availability of more labeled anomalies, the performances of all algorithms are improved significantly.
  • Figure 4: AUCROC with different pollution ratios. Lower pollution ratios consistently lead to performance improvements across all algorithms. Notably, for DeepSAD, MLP, and FTTransformer, NNG-Mix substantially enhances the Baseline setups, regardless of the pollution ratio. In contrast, for XGBOD and CatB, when the pollution ratio is low, the Baselines already exhibit superior performance, and the introduction of additional pseudo-anomalies does not yield significant benefits. NNG-Mix demonstrates its efficacy primarily when the pollution ratio surpasses the $0.6$ threshold.
  • Figure 5: Parameter Sensitivity. NNG-Mix demonstrates robustness across different parameter settings, displaying fluctuations of less than $2\%$ in most cases.
  • ...and 4 more figures