Table of Contents
Fetching ...

A Generic Machine Learning Framework for Fully-Unsupervised Anomaly Detection with Contaminated Data

Markus Ulmer, Jannik Zgraggen, Lilach Goren Huber

TL;DR

The paper addresses anomaly detection when training data are contaminated by anomalies, a common real-world challenge for residual-based methods. It introduces USDR, a fully unsupervised, model-agnostic data refinement framework that partitions the unlabeled data into overlapping subsets, trains an ensemble of residual-based predictors, and computes a per-sample refinement score from the ensemble's generalization contribution to identify and remove anomalies without labels. Across MIMII acoustic data and CMAPSS turbofan data, USDR markedly improves performance over naive blind training and often approaches or matches the anomaly-free (clean) training reference, demonstrating robustness to varying contamination levels and fault types. The approach is simple, data-centric, and broadly applicable to time-series data, with potential extension to other modalities and deeper exploration of hyperparameter settings.

Abstract

Anomaly detection (AD) tasks have been solved using machine learning algorithms in various domains and applications. The great majority of these algorithms use normal data to train a residual-based model and assign anomaly scores to unseen samples based on their dissimilarity with the learned normal regime. The underlying assumption of these approaches is that anomaly-free data is available for training. This is, however, often not the case in real-world operational settings, where the training data may be contaminated with an unknown fraction of abnormal samples. Training with contaminated data, in turn, inevitably leads to a deteriorated AD performance of the residual-based algorithms. In this paper we introduce a framework for a fully unsupervised refinement of contaminated training data for AD tasks. The framework is generic and can be applied to any residual-based machine learning model. We demonstrate the application of the framework to two public datasets of multivariate time series machine data from different application fields. We show its clear superiority over the naive approach of training with contaminated data without refinement. Moreover, we compare it to the ideal, unrealistic reference in which anomaly-free data would be available for training. The method is based on evaluating the contribution of individual samples to the generalization ability of a given model, and contrasting the contribution of anomalies with the one of normal samples. As a result, the proposed approach is comparable to, and often outperforms training with normal samples only.

A Generic Machine Learning Framework for Fully-Unsupervised Anomaly Detection with Contaminated Data

TL;DR

The paper addresses anomaly detection when training data are contaminated by anomalies, a common real-world challenge for residual-based methods. It introduces USDR, a fully unsupervised, model-agnostic data refinement framework that partitions the unlabeled data into overlapping subsets, trains an ensemble of residual-based predictors, and computes a per-sample refinement score from the ensemble's generalization contribution to identify and remove anomalies without labels. Across MIMII acoustic data and CMAPSS turbofan data, USDR markedly improves performance over naive blind training and often approaches or matches the anomaly-free (clean) training reference, demonstrating robustness to varying contamination levels and fault types. The approach is simple, data-centric, and broadly applicable to time-series data, with potential extension to other modalities and deeper exploration of hyperparameter settings.

Abstract

Anomaly detection (AD) tasks have been solved using machine learning algorithms in various domains and applications. The great majority of these algorithms use normal data to train a residual-based model and assign anomaly scores to unseen samples based on their dissimilarity with the learned normal regime. The underlying assumption of these approaches is that anomaly-free data is available for training. This is, however, often not the case in real-world operational settings, where the training data may be contaminated with an unknown fraction of abnormal samples. Training with contaminated data, in turn, inevitably leads to a deteriorated AD performance of the residual-based algorithms. In this paper we introduce a framework for a fully unsupervised refinement of contaminated training data for AD tasks. The framework is generic and can be applied to any residual-based machine learning model. We demonstrate the application of the framework to two public datasets of multivariate time series machine data from different application fields. We show its clear superiority over the naive approach of training with contaminated data without refinement. Moreover, we compare it to the ideal, unrealistic reference in which anomaly-free data would be available for training. The method is based on evaluating the contribution of individual samples to the generalization ability of a given model, and contrasting the contribution of anomalies with the one of normal samples. As a result, the proposed approach is comparable to, and often outperforms training with normal samples only.
Paper Structure (22 sections, 8 equations, 5 figures, 2 tables)

This paper contains 22 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The proposed Unsupervised Data Refinement framework.
  • Figure 2: MIMII experiment design.
  • Figure 3: Examples of the framework performance for the MIMII data (test case I). The derived scores of the USDR framework are compared with the scores of blind training with the contaminated data and to clean training with normal data as a reference, calculated using PCA (upper row) and AE (lower row). The results are shown for selected cases: Pump (id00,0dB), fan (id00,6dB), and valve (id02,0dB). The two columns on the right show the precision-recall curves (PRC) and the average precision (AP) for the three methods.
  • Figure 4: Examples of the framework performance for the MIMII data (test case II). The figure structure is similar to Fig. \ref{['fig:single_fault']}, demonstrated for the Fan system. Here we assumed three short faulty periods instead of a single long fault.
  • Figure 5: Examples for the framework performance with the turbofan engine CMAPSS data. The derived scores of the USDR framework are compared with the scores of blind training with the all contaminated data, blind ensemble, and clean training with normal data (first 10 engine cycles) as a reference, calculated using PCA (upper row) and AE (lower row). The results are shown for engine 5. The column on the right shows the RMSE for the four methods.