Table of Contents
Fetching ...

Unsupervised Anomaly Detection for Tabular Data Using Noise Evaluation

Wei Dai, Kai Hwang, Jicong Fan

TL;DR

This work tackles unsupervised anomaly detection for tabular data by introducing a noise-evaluation framework that learns a one-class boundary without using real anomalies. A neural network $h_{\boldsymbol{\theta}}$ predicts per-feature noise magnitudes on noisy variants of clean data, with an aggregation $g(\cdot)$ producing a scalar anomaly score $\text{score}(\boldsymbol{x})=g(h_{\boldsymbol{\theta}}(\boldsymbol{x}))$. The authors provide theoretical guarantees (Theorems on hard and easy anomaly detection) and demonstrate state-of-the-art performance on 47 UAD and 25 OCC tabular datasets, using noise types such as Gaussian, Rayleigh, and Uniform to generate diverse perturbations. The approach is lightweight to train, scalable, and applicable across domains, offering practical guarantees and robust performance without requiring real anomalous samples during training.

Abstract

Unsupervised anomaly detection (UAD) plays an important role in modern data analytics and it is crucial to provide simple yet effective and guaranteed UAD algorithms for real applications. In this paper, we present a novel UAD method for tabular data by evaluating how much noise is in the data. Specifically, we propose to learn a deep neural network from the clean (normal) training dataset and a noisy dataset, where the latter is generated by adding highly diverse noises to the clean data. The neural network can learn a reliable decision boundary between normal data and anomalous data when the diversity of the generated noisy data is sufficiently high so that the hard abnormal samples lie in the noisy region. Importantly, we provide theoretical guarantees, proving that the proposed method can detect anomalous data successfully, although the method does not utilize any real anomalous data in the training stage. Extensive experiments through more than 60 benchmark datasets demonstrate the effectiveness of the proposed method in comparison to 12 baselines of UAD. Our method obtains a 92.27\% AUC score and a 1.68 ranking score on average. Moreover, compared to the state-of-the-art UAD methods, our method is easier to implement.

Unsupervised Anomaly Detection for Tabular Data Using Noise Evaluation

TL;DR

This work tackles unsupervised anomaly detection for tabular data by introducing a noise-evaluation framework that learns a one-class boundary without using real anomalies. A neural network predicts per-feature noise magnitudes on noisy variants of clean data, with an aggregation producing a scalar anomaly score . The authors provide theoretical guarantees (Theorems on hard and easy anomaly detection) and demonstrate state-of-the-art performance on 47 UAD and 25 OCC tabular datasets, using noise types such as Gaussian, Rayleigh, and Uniform to generate diverse perturbations. The approach is lightweight to train, scalable, and applicable across domains, offering practical guarantees and robust performance without requiring real anomalous samples during training.

Abstract

Unsupervised anomaly detection (UAD) plays an important role in modern data analytics and it is crucial to provide simple yet effective and guaranteed UAD algorithms for real applications. In this paper, we present a novel UAD method for tabular data by evaluating how much noise is in the data. Specifically, we propose to learn a deep neural network from the clean (normal) training dataset and a noisy dataset, where the latter is generated by adding highly diverse noises to the clean data. The neural network can learn a reliable decision boundary between normal data and anomalous data when the diversity of the generated noisy data is sufficiently high so that the hard abnormal samples lie in the noisy region. Importantly, we provide theoretical guarantees, proving that the proposed method can detect anomalous data successfully, although the method does not utilize any real anomalous data in the training stage. Extensive experiments through more than 60 benchmark datasets demonstrate the effectiveness of the proposed method in comparison to 12 baselines of UAD. Our method obtains a 92.27\% AUC score and a 1.68 ranking score on average. Moreover, compared to the state-of-the-art UAD methods, our method is easier to implement.

Paper Structure

This paper contains 49 sections, 5 theorems, 32 equations, 11 figures, 11 tables, 2 algorithms.

Key Result

Proposition 1

Adding random noises independently to the entries of $\mathcal{X}$ makes the data more disordered (higher entropy).

Figures (11)

  • Figure 1: An illustration of the allocation of normal, noised, and true anomalous samples. $\mathcal{D}$, $\hat{\mathcal{D}}$, and $\tilde{\mathcal{D}}$ are the normal, noised, and anomalous distributions respectively. $\tilde{\mathcal{D}}$ is composed of a hard part $\tilde{\mathcal{D}}_H$ and an easy part $\tilde{\mathcal{D}}_E$. Theorem \ref{['theorem: gap']} and Theorem \ref{['theorem_dmin']} are for $\tilde{\mathcal{D}}_H$ and $\tilde{\mathcal{D}}_E$ respectively.
  • Figure 2: The training process of noise evaluation model. Noise with 0 mean and $\sigma$ standard deviation is added to the original data $\boldsymbol{x}$ to create noised versions $\hat{\boldsymbol{x}}= \boldsymbol{x}+\boldsymbol{\epsilon}$. The model $h_{\boldsymbol{\theta}}$ is trained to discern the zero vector for the original data and identify the noise vector $|\boldsymbol{\epsilon}|$ for the noised data. The final anomaly decision is made using an aggregation function $g(\cdot)$, where high-magnitude noise indicates abnormality.
  • Figure 3: Comparison of different $g(\cdot)$, i.e. mean, maximum, and minimum, on KDD-CUP99, at each optimization epoch.
  • Figure 4: AUC (%) and F1 (%) score of the proposed method compared with 11 baselines on 47 benchmark datasets. Each experiment is repeated 10 times with random seed from 0 to 9, and mean value and 95% confidence interval are reported. Rank (the lower the better) is calculated out of 12 tested methods.
  • Figure 5: Sensitivity of Different Noise Level in $[0.1, 0.2, 0.5, 0.8, 1.0, 2.0, 3.0, 5.0]$. The mean rank (the lower the better) is calculated out of 8 noise levels.
  • ...and 6 more figures

Theorems & Definitions (14)

  • Proposition 1
  • Claim 1
  • Definition 1
  • Theorem 1
  • Theorem 2
  • proof
  • proof
  • Definition 2
  • Lemma 1
  • proof
  • ...and 4 more