Table of Contents
Fetching ...

FADE: Few-shot/zero-shot Anomaly Detection Engine using Large Vision-Language Model

Yuanwei Li, Elizaveta Ivanova, Martins Bruveris

TL;DR

FADE addresses the practical challenge of zero-/few-shot industrial anomaly detection by repurposing the CLIP vision-language model with Grounding Everything Module (GEM) patch embeddings, multi-scale analysis, and an automated prompt ensemble generated by a large language model. It implements four detection pipelines that integrate language-guided and vision-guided cues for anomaly classification and segmentation without fine-tuning, leveraging memory banks for few-shot scenarios. Empirical results on MVTec-AD and VisA show competitive to state-of-the-art performance, with notable gains in zero-shot anomaly segmentation. The approach offers a scalable, training-light pathway for industrial quality inspection, while highlighting areas for reproducibility and further study of embedding choices across tasks.

Abstract

Automatic image anomaly detection is important for quality inspection in the manufacturing industry. The usual unsupervised anomaly detection approach is to train a model for each object class using a dataset of normal samples. However, a more realistic problem is zero-/few-shot anomaly detection where zero or only a few normal samples are available. This makes the training of object-specific models challenging. Recently, large foundation vision-language models have shown strong zero-shot performance in various downstream tasks. While these models have learned complex relationships between vision and language, they are not specifically designed for the tasks of anomaly detection. In this paper, we propose the Few-shot/zero-shot Anomaly Detection Engine (FADE) which leverages the vision-language CLIP model and adjusts it for the purpose of industrial anomaly detection. Specifically, we improve language-guided anomaly segmentation 1) by adapting CLIP to extract multi-scale image patch embeddings that are better aligned with language and 2) by automatically generating an ensemble of text prompts related to industrial anomaly detection. 3) We use additional vision-based guidance from the query and reference images to further improve both zero-shot and few-shot anomaly detection. On the MVTec-AD (and VisA) dataset, FADE outperforms other state-of-the-art methods in anomaly segmentation with pixel-AUROC of 89.6% (91.5%) in zero-shot and 95.4% (97.5%) in 1-normal-shot. Code is available at https://github.com/BMVC-FADE/BMVC-FADE.

FADE: Few-shot/zero-shot Anomaly Detection Engine using Large Vision-Language Model

TL;DR

FADE addresses the practical challenge of zero-/few-shot industrial anomaly detection by repurposing the CLIP vision-language model with Grounding Everything Module (GEM) patch embeddings, multi-scale analysis, and an automated prompt ensemble generated by a large language model. It implements four detection pipelines that integrate language-guided and vision-guided cues for anomaly classification and segmentation without fine-tuning, leveraging memory banks for few-shot scenarios. Empirical results on MVTec-AD and VisA show competitive to state-of-the-art performance, with notable gains in zero-shot anomaly segmentation. The approach offers a scalable, training-light pathway for industrial quality inspection, while highlighting areas for reproducibility and further study of embedding choices across tasks.

Abstract

Automatic image anomaly detection is important for quality inspection in the manufacturing industry. The usual unsupervised anomaly detection approach is to train a model for each object class using a dataset of normal samples. However, a more realistic problem is zero-/few-shot anomaly detection where zero or only a few normal samples are available. This makes the training of object-specific models challenging. Recently, large foundation vision-language models have shown strong zero-shot performance in various downstream tasks. While these models have learned complex relationships between vision and language, they are not specifically designed for the tasks of anomaly detection. In this paper, we propose the Few-shot/zero-shot Anomaly Detection Engine (FADE) which leverages the vision-language CLIP model and adjusts it for the purpose of industrial anomaly detection. Specifically, we improve language-guided anomaly segmentation 1) by adapting CLIP to extract multi-scale image patch embeddings that are better aligned with language and 2) by automatically generating an ensemble of text prompts related to industrial anomaly detection. 3) We use additional vision-based guidance from the query and reference images to further improve both zero-shot and few-shot anomaly detection. On the MVTec-AD (and VisA) dataset, FADE outperforms other state-of-the-art methods in anomaly segmentation with pixel-AUROC of 89.6% (91.5%) in zero-shot and 95.4% (97.5%) in 1-normal-shot. Code is available at https://github.com/BMVC-FADE/BMVC-FADE.
Paper Structure (10 sections, 2 equations, 5 figures, 9 tables)

This paper contains 10 sections, 2 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 2: Different components of FADE. (a) Zero-shot language-guided AC; (b) Zero-shot language-guided AS; (c) Zero-shot vision-guided AS; (d) Few-shot vision-guided AC and AS; (e) ChatGPT prompts generation: An instruction given to ChatGPT and some of its responses.
  • Figure : (a) MVTec-AD
  • Figure : (a) MVTec-AD
  • Figure : (b) VisA
  • Figure :