Table of Contents
Fetching ...

AnoPLe: Few-Shot Anomaly Detection via Bi-directional Prompt Learning with Only Normal Samples

Yujin Lee, Seoyoon Jang, Hyunsoo Yoon

TL;DR

AnoPLe addresses few-shot anomaly detection without access to true anomalies by introducing bidirectional, learnable prompts that couple textual and visual modalities within CLIP, along with a lightweight multi-view decoder and a memory-guided localization mechanism. It simulates anomalies in both pixel and latent spaces and trains with losses that align local (pixel-level) and global (image-level) semantics, achieving strong image- and pixel-level AUROCs on MVTec-AD ($I$-AUROC) and VisA benchmarks (e.g., 94.1% on MVTec-AD and 86.2% on VisA in 1-shot) while avoiding true anomaly data. Across ablations and prompt-guided evaluations, AnoPLe consistently outperforms non-anomaly-aware baselines and remains competitive with state-of-the-art methods that use true anomalies, demonstrating robust performance under 1-, 2-, and 4-shot regimes. The work shows practical impact by enabling reliable anomaly detection with only normal samples, reducing data requirements and enabling scalable deployment in industrial inspection scenarios.

Abstract

Few-shot Anomaly Detection (FAD) poses significant challenges due to the limited availability of training samples and the frequent absence of abnormal samples. Previous approaches often rely on annotations or true abnormal samples to improve detection, but such textual or visual cues are not always accessible. To address this, we introduce AnoPLe, a multi-modal prompt learning method designed for anomaly detection without prior knowledge of anomalies. AnoPLe simulates anomalies and employs bidirectional coupling of textual and visual prompts to facilitate deep interaction between the two modalities. Additionally, we integrate a lightweight decoder with a learnable multi-view signal, trained on multi-scale images to enhance local semantic comprehension. To further improve performance, we align global and local semantics, enriching the image-level understanding of anomalies. The experimental results demonstrate that AnoPLe achieves strong FAD performance, recording 94.1% and 86.2% Image AUROC on MVTec-AD and VisA respectively, with only around a 1% gap compared to the SoTA, despite not being exposed to true anomalies. Code is available at https://github.com/YoojLee/AnoPLe.

AnoPLe: Few-Shot Anomaly Detection via Bi-directional Prompt Learning with Only Normal Samples

TL;DR

AnoPLe addresses few-shot anomaly detection without access to true anomalies by introducing bidirectional, learnable prompts that couple textual and visual modalities within CLIP, along with a lightweight multi-view decoder and a memory-guided localization mechanism. It simulates anomalies in both pixel and latent spaces and trains with losses that align local (pixel-level) and global (image-level) semantics, achieving strong image- and pixel-level AUROCs on MVTec-AD (-AUROC) and VisA benchmarks (e.g., 94.1% on MVTec-AD and 86.2% on VisA in 1-shot) while avoiding true anomaly data. Across ablations and prompt-guided evaluations, AnoPLe consistently outperforms non-anomaly-aware baselines and remains competitive with state-of-the-art methods that use true anomalies, demonstrating robust performance under 1-, 2-, and 4-shot regimes. The work shows practical impact by enabling reliable anomaly detection with only normal samples, reducing data requirements and enabling scalable deployment in industrial inspection scenarios.

Abstract

Few-shot Anomaly Detection (FAD) poses significant challenges due to the limited availability of training samples and the frequent absence of abnormal samples. Previous approaches often rely on annotations or true abnormal samples to improve detection, but such textual or visual cues are not always accessible. To address this, we introduce AnoPLe, a multi-modal prompt learning method designed for anomaly detection without prior knowledge of anomalies. AnoPLe simulates anomalies and employs bidirectional coupling of textual and visual prompts to facilitate deep interaction between the two modalities. Additionally, we integrate a lightweight decoder with a learnable multi-view signal, trained on multi-scale images to enhance local semantic comprehension. To further improve performance, we align global and local semantics, enriching the image-level understanding of anomalies. The experimental results demonstrate that AnoPLe achieves strong FAD performance, recording 94.1% and 86.2% Image AUROC on MVTec-AD and VisA respectively, with only around a 1% gap compared to the SoTA, despite not being exposed to true anomalies. Code is available at https://github.com/YoojLee/AnoPLe.
Paper Structure (48 sections, 15 equations, 12 figures, 6 tables)

This paper contains 48 sections, 15 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: (Left) Different prompt designs used in prompt-guided anomaly detection. (Right) Comparative results in prompt-guided anomaly detection on MVTec-AD bergmann2019mvtec and VisA zou2022spot. PromptAD-g is PromptAD li2024promptadcvpr without object customized prompts, while MaPLe+DRAEM combines MaPLe khattak2023maple with pseudo anomalies zavrtanik2021draem. AnoPLe* and AnoPLe refer to our method in the one- and multi-class settings, respectively.
  • Figure 2: Overview of AnoPLe. AnoPLe leverages learnable multi-modal deep prompts with bidirectional coupling between textual and visual prompts, utilizing pseudo anomalies. For localization, we introduce a multi-view aware decoder, enabling the model to effectively learn both local and global anomalies. During training, prompts are updated through local/global level losses and an alignment loss. During inference, the anomaly score is derived from prediction logits and a visual memory bank.
  • Figure 3: Qualitative Results on MVTec and VisA for 1-shot Pixel-level Anomaly Detection. We present heat maps and images highlighted with anomaly regions for each method.
  • Figure 4: Image-level Feature Visualization using PCA in a 1-shot setting. The features "hazelnut" for MVTec and "pcb4" for VisA are used.
  • Figure 5: Comparison Across Different Textual and Visual Context Lengths for MVTec and VisA.
  • ...and 7 more figures