Table of Contents
Fetching ...

Can I trust my anomaly detection system? A case study based on explainable AI

Muhammad Rashid, Elvio Amparore, Enrico Ferrari, Damiano Verda

TL;DR

This work questions the reliability of anomaly detectors that rely on reconstruction-based scores from VAE-GAN models by introducing model-agnostic explainable AI (LIME and SHAP) to audit not just whether an instance is labeled anomalous, but why. It formalizes an end-to-end workflow on the MVTEC dataset, defines reconstruction-error maps and a calibrated anomaly-threshold, and couples this with an IoU-based ground-truth comparison for explanations. The study demonstrates that samples can be flagged as anomalies for incorrect reasons and that explanation quality varies with the XAI method and segmentation strategy, underscoring the need for explainability-driven validation in deployment. Overall, the results advocate using XAI-driven localization (via maximal IoU) alongside traditional anomaly scores to improve the trustworthiness and reliability of AD systems in real-world settings.

Abstract

Generative models based on variational autoencoders are a popular technique for detecting anomalies in images in a semi-supervised context. A common approach employs the anomaly score to detect the presence of anomalies, and it is known to reach high level of accuracy on benchmark datasets. However, since anomaly scores are computed from reconstruction disparities, they often obscure the detection of various spurious features, raising concerns regarding their actual efficacy. This case study explores the robustness of an anomaly detection system based on variational autoencoder generative models through the use of eXplainable AI methods. The goal is to get a different perspective on the real performances of anomaly detectors that use reconstruction differences. In our case study we discovered that, in many cases, samples are detected as anomalous for the wrong or misleading factors.

Can I trust my anomaly detection system? A case study based on explainable AI

TL;DR

This work questions the reliability of anomaly detectors that rely on reconstruction-based scores from VAE-GAN models by introducing model-agnostic explainable AI (LIME and SHAP) to audit not just whether an instance is labeled anomalous, but why. It formalizes an end-to-end workflow on the MVTEC dataset, defines reconstruction-error maps and a calibrated anomaly-threshold, and couples this with an IoU-based ground-truth comparison for explanations. The study demonstrates that samples can be flagged as anomalies for incorrect reasons and that explanation quality varies with the XAI method and segmentation strategy, underscoring the need for explainability-driven validation in deployment. Overall, the results advocate using XAI-driven localization (via maximal IoU) alongside traditional anomaly scores to improve the trustworthiness and reliability of AD systems in real-world settings.

Abstract

Generative models based on variational autoencoders are a popular technique for detecting anomalies in images in a semi-supervised context. A common approach employs the anomaly score to detect the presence of anomalies, and it is known to reach high level of accuracy on benchmark datasets. However, since anomaly scores are computed from reconstruction disparities, they often obscure the detection of various spurious features, raising concerns regarding their actual efficacy. This case study explores the robustness of an anomaly detection system based on variational autoencoder generative models through the use of eXplainable AI methods. The goal is to get a different perspective on the real performances of anomaly detectors that use reconstruction differences. In our case study we discovered that, in many cases, samples are detected as anomalous for the wrong or misleading factors.
Paper Structure (12 sections, 6 equations, 4 figures)

This paper contains 12 sections, 6 equations, 4 figures.

Figures (4)

  • Figure 1: AD system using a VAE-GAN model with LIME explanations.
  • Figure 2: Maximum IoU vs the anomaly scores in the two test datasets.
  • Figure 3: Explanations for a few anomalous samples of the hazelnut dataset.
  • Figure 4: Explanations for a few anomalous samples of the screw dataset.