AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance

Joao P. C. Bertoldo; Dick Ameln; Ashwin Vaidya; Samet Akçay

AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance

Joao P. C. Bertoldo, Dick Ameln, Ashwin Vaidya, Samet Akçay

TL;DR

Per-IMage Overlap (PIMO) is introduced, a novel metric that addresses the shortcomings of AUROC and AUPRO and offers practical advantages and nuanced performance insights that redefine anomaly detection benchmarks -- notably challenging the perception that MVTec AD and VisA datasets have been solved by contemporary models.

Abstract

Recent advances in visual anomaly detection research have seen AUROC and AUPRO scores on public benchmark datasets such as MVTec and VisA converge towards perfect recall, giving the impression that these benchmarks are near-solved. However, high AUROC and AUPRO scores do not always reflect qualitative performance, which limits the validity of these metrics in real-world applications. We argue that the artificial ceiling imposed by the lack of an adequate evaluation metric restrains progression of the field, and it is crucial that we revisit the evaluation metrics used to rate our algorithms. In response, we introduce Per-IMage Overlap (PIMO), a novel metric that addresses the shortcomings of AUROC and AUPRO. PIMO retains the recall-based nature of the existing metrics but introduces two distinctions: the assignment of curves (and respective area under the curve) is per-image, and its X-axis relies solely on normal images. Measuring recall per image simplifies instance score indexing and is more robust to noisy annotations. As we show, it also accelerates computation and enables the usage of statistical tests to compare models. By imposing low tolerance for false positives on normal images, PIMO provides an enhanced model validation procedure and highlights performance variations across datasets. Our experiments demonstrate that PIMO offers practical advantages and nuanced performance insights that redefine anomaly detection benchmarks -- notably challenging the perception that MVTec AD and VisA datasets have been solved by contemporary models. Available on GitHub: https://github.com/jpcbertoldo/aupimo.

AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance

TL;DR

Abstract

Paper Structure (50 sections, 4 equations, 50 figures, 5 tables)

This paper contains 50 sections, 4 equations, 50 figures, 5 tables.

Introduction
Related Work
Metrics
Precursors: AUROC and AUPRO
Our Approach: AUPIMO
AUPIMO's properties
Bias-free validation
Anomaly-dependent metrics
Low tolerance
AUPRO vs. AUPIMO
Image-scoped metrics
Image-specific scores
Experimental Setup
Results
Benchmark on MVTec AD / Zipper
...and 35 more sections

Figures (50)

Figure 1: Left: performance on MVTec AD over time, approaching a near 100% performance plateau. Right: images from the dataset Pill (left column) and their inferred anomaly maps (right column; higher values mean anomalous; JET colormap) from the best performing model in this dataset (EfficientAD; see \ref{['app:benchmark']}), with 98.7% AUROC and 96.7% AUPRO. The normal image (top) has higher anomaly scores than the anomaly (bottom).
Figure 2: AUPRO and AUPIMO's upper bounds visualized as level sets from the anomaly score maps. Solid contours are level sets at thresholds yielding the maximum FPR in AUPRO (white) and AUPIMO (black). Images from the dataset MVTec AD/ Capsule.
Figure 3: (\ref{['fig:curves-bench-rocpro']}, \ref{['fig:curves-bench-pimo']}) ROC, PRO, and PIMO curves. The y-axes are TPR metrics: ROC uses the set TPR (all anomalous pixels from all images confounded); PRO uses the region-scoped TPR averaged across all regions from all images; PIMO uses the image-scoped TPR keeping one curve per anomalous image (no cross-instance averaging). The x-axes are FPR metrics shared by all instances (i.e anom. regions for PRO and anom. images for PIMO), which indexes the binarization thresholds. ROC and PRO use the set FPR (all normal pixels from all images confounded) in linear scale. PIMO uses the image-scoped FPR averaged accross normal images only in log scale. The curves are summarized by their (normalized) area under the curve (AUC), with different integration ranges: AUROC in $[0,1]$, AUPRO in $[0, 0.3]$\ref{['footauprofive']}, and AUPIMO in $[10^{-5}, 10^{-4}]$. (\ref{['fig:curves-bench-bench']}) Benchmark on dataset MVTec AD / Zipper shows how their AUCs differ.
Figure 4: Dataset-wise comparison. Each triangle is a set-scoped score (AUROC, AUPRO, and AUPRO$_{5\%}$) or a cross-image statistic (average AUPIMO) from a dataset in MVTec AD ($\vartriangle$) or VisA ($\triangledown$). Diamonds are cross-dataset averages (all confounded). Plots have different x-axis scales. AUPIMO reveals that all models have a large cross-problem variance, meaning that none of the models is robust to all problems.
Figure 5: (a) Execution time of metrics on MVTec AD / Screw dataset (image resolution of $1024 \times 1024$; average times over 3 runs). (b, top) An anomalous sample from the dataset VisA / Chewing Gum superimposed with its annotation (pink) shows meaningless, tiny (even 1-pixel) regions (the mask has not been downsampled). (b, bottom) Robustness to noisy annotation. Histograms show the distribution of the difference between the scores without and with the synthetic mistakes (closer to zero is better).
...and 45 more figures

AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance

TL;DR

Abstract

AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance

Authors

TL;DR

Abstract

Table of Contents

Figures (50)