Table of Contents
Fetching ...

EntropyStop: Unsupervised Deep Outlier Detection with Loss Entropy

Yihong Huang, Yuang Zhang, Liping Wang, Fan Zhang, Xuemin Lin

TL;DR

The paper tackles unsupervised outlier detection on contaminated data where labels are unavailable. It introduces Loss Entropy $H_L$, a zero-label internal metric derived from the loss distribution, and the EntropyStop automated early stopping algorithm to halt training before performance degrades due to outliers. The authors prove a negative correlation between $H_L$ and AUC under the inlier-priority assumption and validate the approach on 47 real datasets, showing AutoEncoder-based models with EntropyStop outperform ensembles while dramatically reducing training time. The method generalizes to other deep UOD models and is open-sourced, offering a practical, scalable tool for robust, efficient anomaly detection in real-world, noisy data settings.

Abstract

Unsupervised Outlier Detection (UOD) is an important data mining task. With the advance of deep learning, deep Outlier Detection (OD) has received broad interest. Most deep UOD models are trained exclusively on clean datasets to learn the distribution of the normal data, which requires huge manual efforts to clean the real-world data if possible. Instead of relying on clean datasets, some approaches directly train and detect on unlabeled contaminated datasets, leading to the need for methods that are robust to such conditions. Ensemble methods emerged as a superior solution to enhance model robustness against contaminated training sets. However, the training time is greatly increased by the ensemble. In this study, we investigate the impact of outliers on the training phase, aiming to halt training on unlabeled contaminated datasets before performance degradation. Initially, we noted that blending normal and anomalous data causes AUC fluctuations, a label-dependent measure of detection accuracy. To circumvent the need for labels, we propose a zero-label entropy metric named Loss Entropy for loss distribution, enabling us to infer optimal stopping points for training without labels. Meanwhile, we theoretically demonstrate negative correlation between entropy metric and the label-based AUC. Based on this, we develop an automated early-stopping algorithm, EntropyStop, which halts training when loss entropy suggests the maximum model detection capability. We conduct extensive experiments on ADBench (including 47 real datasets), and the overall results indicate that AutoEncoder (AE) enhanced by our approach not only achieves better performance than ensemble AEs but also requires under 2\% of training time. Lastly, our proposed metric and early-stopping approach are evaluated on other deep OD models, exhibiting their broad potential applicability.

EntropyStop: Unsupervised Deep Outlier Detection with Loss Entropy

TL;DR

The paper tackles unsupervised outlier detection on contaminated data where labels are unavailable. It introduces Loss Entropy , a zero-label internal metric derived from the loss distribution, and the EntropyStop automated early stopping algorithm to halt training before performance degrades due to outliers. The authors prove a negative correlation between and AUC under the inlier-priority assumption and validate the approach on 47 real datasets, showing AutoEncoder-based models with EntropyStop outperform ensembles while dramatically reducing training time. The method generalizes to other deep UOD models and is open-sourced, offering a practical, scalable tool for robust, efficient anomaly detection in real-world, noisy data settings.

Abstract

Unsupervised Outlier Detection (UOD) is an important data mining task. With the advance of deep learning, deep Outlier Detection (OD) has received broad interest. Most deep UOD models are trained exclusively on clean datasets to learn the distribution of the normal data, which requires huge manual efforts to clean the real-world data if possible. Instead of relying on clean datasets, some approaches directly train and detect on unlabeled contaminated datasets, leading to the need for methods that are robust to such conditions. Ensemble methods emerged as a superior solution to enhance model robustness against contaminated training sets. However, the training time is greatly increased by the ensemble. In this study, we investigate the impact of outliers on the training phase, aiming to halt training on unlabeled contaminated datasets before performance degradation. Initially, we noted that blending normal and anomalous data causes AUC fluctuations, a label-dependent measure of detection accuracy. To circumvent the need for labels, we propose a zero-label entropy metric named Loss Entropy for loss distribution, enabling us to infer optimal stopping points for training without labels. Meanwhile, we theoretically demonstrate negative correlation between entropy metric and the label-based AUC. Based on this, we develop an automated early-stopping algorithm, EntropyStop, which halts training when loss entropy suggests the maximum model detection capability. We conduct extensive experiments on ADBench (including 47 real datasets), and the overall results indicate that AutoEncoder (AE) enhanced by our approach not only achieves better performance than ensemble AEs but also requires under 2\% of training time. Lastly, our proposed metric and early-stopping approach are evaluated on other deep OD models, exhibiting their broad potential applicability.
Paper Structure (48 sections, 1 theorem, 37 equations, 25 figures, 11 tables, 1 algorithm)

This paper contains 48 sections, 1 theorem, 37 equations, 25 figures, 11 tables, 1 algorithm.

Key Result

Theorem 4.1

When $\mathcal{L}_{in} < \mathcal{L}_{out}$ and the AUC increases, the $H_L$ is more likely to decrease.

Figures (25)

  • Figure 1: Two paradigms of unsupervised OD
  • Figure 2: UOD training process of AutoEncoder on 2 datasets
  • Figure 3: Loss Gap for inliers and outliers in AE models on MNIST and Letter datasets
  • Figure 4: An example of the training process. The AE model is trained on the dataset Ionosphere with 300 iterations. In this example, the lowest $H_L$ exactly matches the optimal AUC at the $49^{th}$ iteration. The y-axis of two scatter plot (i.e. the $4^{th}$ figure and the $5^{th}$ figure) is normalized data loss value $u_i$.
  • Figure 5: Examples of AUC and loss entropy curves during the training of AE and RDP RDP on some datasets. "select_iter" denotes the iteration selected by EntropyStop.
  • ...and 20 more figures

Theorems & Definitions (4)

  • Theorem 4.1
  • proof
  • proof
  • proof