Table of Contents
Fetching ...

AnomalyDINO: Boosting Patch-based Few-shot Anomaly Detection with DINOv2

Simon Damm, Mike Laszkiewicz, Johannes Lederer, Asja Fischer

TL;DR

AnomalyDINO introduces a vision-only, training-free patch-based anomaly detector that uses high-quality DINOv2 features and a memory bank of nominal patches to detect industrial defects. By applying zero-shot masking and simple rotations, the method constructs robust patch representations and scores anomalies via a tail-based aggregation of patch distances, enabling both image-level detection and pixel-level localization. Across MVTec-AD and VisA, AnomalyDINO achieves state-of-the-art or competitive results in one-/few-shot settings with markedly faster inference than many multimodal baselines, making it well-suited for fast industrial deployment. The work highlights the strength of visual features over language-augmented models for certain few-shot anomaly tasks and outlines actionable follow-ups to further boost performance and batched-zero-shot capabilities.

Abstract

Recent advances in multimodal foundation models have set new standards in few-shot anomaly detection. This paper explores whether high-quality visual features alone are sufficient to rival existing state-of-the-art vision-language models. We affirm this by adapting DINOv2 for one-shot and few-shot anomaly detection, with a focus on industrial applications. We show that this approach does not only rival existing techniques but can even outmatch them in many settings. Our proposed vision-only approach, AnomalyDINO, follows the well-established patch-level deep nearest neighbor paradigm, and enables both image-level anomaly prediction and pixel-level anomaly segmentation. The approach is methodologically simple and training-free and, thus, does not require any additional data for fine-tuning or meta-learning. The approach is methodologically simple and training-free and, thus, does not require any additional data for fine-tuning or meta-learning. Despite its simplicity, AnomalyDINO achieves state-of-the-art results in one- and few-shot anomaly detection (e.g., pushing the one-shot performance on MVTec-AD from an AUROC of 93.1% to 96.6%). The reduced overhead, coupled with its outstanding few-shot performance, makes AnomalyDINO a strong candidate for fast deployment, e.g., in industrial contexts.

AnomalyDINO: Boosting Patch-based Few-shot Anomaly Detection with DINOv2

TL;DR

AnomalyDINO introduces a vision-only, training-free patch-based anomaly detector that uses high-quality DINOv2 features and a memory bank of nominal patches to detect industrial defects. By applying zero-shot masking and simple rotations, the method constructs robust patch representations and scores anomalies via a tail-based aggregation of patch distances, enabling both image-level detection and pixel-level localization. Across MVTec-AD and VisA, AnomalyDINO achieves state-of-the-art or competitive results in one-/few-shot settings with markedly faster inference than many multimodal baselines, making it well-suited for fast industrial deployment. The work highlights the strength of visual features over language-augmented models for certain few-shot anomaly tasks and outlines actionable follow-ups to further boost performance and batched-zero-shot capabilities.

Abstract

Recent advances in multimodal foundation models have set new standards in few-shot anomaly detection. This paper explores whether high-quality visual features alone are sufficient to rival existing state-of-the-art vision-language models. We affirm this by adapting DINOv2 for one-shot and few-shot anomaly detection, with a focus on industrial applications. We show that this approach does not only rival existing techniques but can even outmatch them in many settings. Our proposed vision-only approach, AnomalyDINO, follows the well-established patch-level deep nearest neighbor paradigm, and enables both image-level anomaly prediction and pixel-level anomaly segmentation. The approach is methodologically simple and training-free and, thus, does not require any additional data for fine-tuning or meta-learning. The approach is methodologically simple and training-free and, thus, does not require any additional data for fine-tuning or meta-learning. Despite its simplicity, AnomalyDINO achieves state-of-the-art results in one- and few-shot anomaly detection (e.g., pushing the one-shot performance on MVTec-AD from an AUROC of 93.1% to 96.6%). The reduced overhead, coupled with its outstanding few-shot performance, makes AnomalyDINO a strong candidate for fast deployment, e.g., in industrial contexts.
Paper Structure (41 sections, 7 equations, 18 figures, 10 tables)

This paper contains 41 sections, 7 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Anomaly detection with AnomalyDINO based on a single immaculate reference sample (here category 'Screw' from MVTec-AD). We collect the nominal patch representations from the (potentially augmented) reference sample(s) in the memory bank $\mathcal{M}$. At test time, we select the relevant patch representation via masking (if applicable). The distances of those to the nominal representations in $\mathcal{M}$ give rise to an anomaly map and the corresponding anomaly score $s(\mathbf{x}_\mathrm{test})$ using the aggregation statistic $q$. For both, masking and feature extraction, we utilize DINOv2 ($f$). Further examples for other categories are depicted on the right (and in \ref{['fig:MVTec_examples', 'fig:MVTec_examples2', 'fig:VisA_examples', 'fig:VisA_examples2']} in \ref{['App:DetailedResults']}).
  • Figure 2: Masking test on MVTec-AD. For 'Capsule' and 'Hazelnut' the masking works successfully (top row), while for 'Cable' and 'Transistor' (bottom row) some areas are incorrectly predicted as background that should belong to the object of interest (highlighted in red). See App. \ref{['App:Ablation-Preprocessing']} for the outcomes per object.
  • Figure 3: Detection AUROC vs. inference time per sample on MVTec-AD in the 1-shot setting. The input resolution is given in parentheses after the method name. All runtimes are measured on a single NVIDIA A40 if not stated otherwise. (Note that for ADP and WinCLIP+ no official code is available.)
  • Figure 4: Examples -- MVTec-AD (1/2). Depicted are, from left to right, a test sample per category (Query), the ground truth anomaly annotation (GT), and the predicted anomaly map from AnomalyDINO-S (448) in the 1- and 8-shot settings. The color coding is normalized by the max. score over 'good' test samples.
  • Figure 5: Examples -- MVTec-AD (2/2). See \ref{['fig:MVTec_examples']} for a description.
  • ...and 13 more figures