Table of Contents
Fetching ...

Search is All You Need for Few-shot Anomaly Detection

Qishan Wang, Jia Guo, Shuyong Gao, Haofen Wang, Li Xiong, Junjie Hu, Hanqi Guo, Wenqiang Zhang

TL;DR

This work tackles FSAD in industrial inspection by removing reliance on heavy prompt engineering and dataset-specific training. It introduces VisionAD, a training-free, vision-only approach that uses scalable vision foundation models, dual augmentations, multi-layer feature fusion, and a category-aware memory bank to enable one-for-all multi-class anomaly detection. Across MVTec-AD, VisA, and Real-IAD, VisionAD achieves state-of-the-art 1-shot performance and strong few-shot results, demonstrating robust anomaly localization and competitive full-shot baselines. The approach offers practical value for real-world scenarios with scarce or expensive normal samples, and its training-free nature facilitates rapid deployment with minimal tuning.

Abstract

Few-shot anomaly detection (FSAD) has emerged as a crucial yet challenging task in industrial inspection, where normal distribution modeling must be accomplished with only a few normal images. While existing approaches typically employ multi-modal foundation models combining language and vision modalities for prompt-guided anomaly detection, these methods often demand sophisticated prompt engineering and extensive manual tuning. In this paper, we demonstrate that a straightforward nearest-neighbor search framework can surpass state-of-the-art performance in both single-class and multi-class FSAD scenarios. Our proposed method, VisionAD, consists of four simple yet essential components: (1) scalable vision foundation models that extract universal and discriminative features; (2) dual augmentation strategies - support augmentation to enhance feature matching adaptability and query augmentation to address the oversights of single-view prediction; (3) multi-layer feature integration that captures both low-frequency global context and high-frequency local details with minimal computational overhead; and (4) a class-aware visual memory bank enabling efficient one-for-all multi-class detection. Extensive evaluations across MVTec-AD, VisA, and Real-IAD benchmarks demonstrate VisionAD's exceptional performance. Using only 1 normal images as support, our method achieves remarkable image-level AUROC scores of 97.4%, 94.8%, and 70.8% respectively, outperforming current state-of-the-art approaches by significant margins (+1.6%, +3.2%, and +1.4%). The training-free nature and superior few-shot capabilities of VisionAD make it particularly appealing for real-world applications where samples are scarce or expensive to obtain. Code is available at https://github.com/Qiqigeww/VisionAD.

Search is All You Need for Few-shot Anomaly Detection

TL;DR

This work tackles FSAD in industrial inspection by removing reliance on heavy prompt engineering and dataset-specific training. It introduces VisionAD, a training-free, vision-only approach that uses scalable vision foundation models, dual augmentations, multi-layer feature fusion, and a category-aware memory bank to enable one-for-all multi-class anomaly detection. Across MVTec-AD, VisA, and Real-IAD, VisionAD achieves state-of-the-art 1-shot performance and strong few-shot results, demonstrating robust anomaly localization and competitive full-shot baselines. The approach offers practical value for real-world scenarios with scarce or expensive normal samples, and its training-free nature facilitates rapid deployment with minimal tuning.

Abstract

Few-shot anomaly detection (FSAD) has emerged as a crucial yet challenging task in industrial inspection, where normal distribution modeling must be accomplished with only a few normal images. While existing approaches typically employ multi-modal foundation models combining language and vision modalities for prompt-guided anomaly detection, these methods often demand sophisticated prompt engineering and extensive manual tuning. In this paper, we demonstrate that a straightforward nearest-neighbor search framework can surpass state-of-the-art performance in both single-class and multi-class FSAD scenarios. Our proposed method, VisionAD, consists of four simple yet essential components: (1) scalable vision foundation models that extract universal and discriminative features; (2) dual augmentation strategies - support augmentation to enhance feature matching adaptability and query augmentation to address the oversights of single-view prediction; (3) multi-layer feature integration that captures both low-frequency global context and high-frequency local details with minimal computational overhead; and (4) a class-aware visual memory bank enabling efficient one-for-all multi-class detection. Extensive evaluations across MVTec-AD, VisA, and Real-IAD benchmarks demonstrate VisionAD's exceptional performance. Using only 1 normal images as support, our method achieves remarkable image-level AUROC scores of 97.4%, 94.8%, and 70.8% respectively, outperforming current state-of-the-art approaches by significant margins (+1.6%, +3.2%, and +1.4%). The training-free nature and superior few-shot capabilities of VisionAD make it particularly appealing for real-world applications where samples are scarce or expensive to obtain. Code is available at https://github.com/Qiqigeww/VisionAD.

Paper Structure

This paper contains 20 sections, 6 equations, 10 figures, 18 tables.

Figures (10)

  • Figure 1: Comparisons between VisionAD and existing FSAD methods. (a) Existing FSAD models rely on complex manual or learnable text prompts and simple adapters or V-V attention for local features, resulting in a cumbersome "one-category-one-model" paradigm. (b) Our VisionAD, a plain, training-free vision-guided approach, generalizes effectively across multiple classes ("one-for-all" paradigm). (c) Comparison with previous SoTA methods on MVTec-AD bergmann2019mvtec and VisA zou2022spot across various settings, such as 1-shot, 2-shot, and 4-shot support images.
  • Figure 2: The overall framework of VisionAD. First, the selected k-shots of normal images in the support set are augmented, and an identical view transformation is applied to both the query and all support images (including augmented images). Next, patch features and global features for the reference normal images are extracted using a vision foundation model. These features are then stored in the global memory bank and patch memory bank, respectively. Meanwhile, the features and corresponding categories are combined into key–value pairs. During the testing stage, the category of the image is determined, and the scores from the base view and transformed views are fused to generate the anomaly detection and localization results in the designated patch memory bank.
  • Figure 3: Support Enhancement and Pseudo Multi-View for Anomaly Detection.
  • Figure 4: Schemes of Feature Fusion. (a) Layer-to-layer (sparse). (b) layer-patchify-cat-to-layer. (c) group-to-group, 1-group (Ours). (d) group-to-group, 2-group.
  • Figure 5: Qualitative comparison results of 1-shot pixel-level anomaly detection on MVTec bergmann2019mvtec and VisA zou2022spot.
  • ...and 5 more figures