Table of Contents
Fetching ...

ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection

Ziteng Yang, Jingzehua Xu, Yanshu Li, Zepeng Li, Yeqiang Wang, Xinghui Li

TL;DR

ViP$^{2}$-CLIP addresses zero-shot anomaly detection by replacing class-name templates with image-conditioned prompts via Visual‑Perception Prompting (ViP-Prompt), which fuses global and multi-scale local cues. It introduces an Image‑Conditioned Adapter (ICA) and a Fine-Grained Perception Module (FGP) to produce dynamic, context-aware prompts, and pairs them with Unified Text‑Patch Alignment (UTPA) to jointly optimize image-level detection and pixel-level localization. The method achieves state-of-the-art results across 15 industrial and medical benchmarks and demonstrates robust cross-domain generalization, supported by comprehensive ablations and efficiency analyses. This approach reduces reliance on manual templates and class-name priors, enabling privacy-friendly, scalable ZSAD with strong localization accuracy. The work highlights a practical, interpretable path to robust anomaly detection in data-scarce settings and complex semantic variations.

Abstract

Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP$^{2}$-CLIP. The key insight of ViP$^{2}$-CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, eliminating manual templates and class-name priors. This design enables our model to focus on precise abnormal regions, making it particularly valuable when category labels are ambiguous or privacy-constrained. Extensive experiments on 15 industrial and medical benchmarks demonstrate that ViP$^{2}$-CLIP achieves state-of-the-art performance and robust cross-domain generalization.

ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection

TL;DR

ViP-CLIP addresses zero-shot anomaly detection by replacing class-name templates with image-conditioned prompts via Visual‑Perception Prompting (ViP-Prompt), which fuses global and multi-scale local cues. It introduces an Image‑Conditioned Adapter (ICA) and a Fine-Grained Perception Module (FGP) to produce dynamic, context-aware prompts, and pairs them with Unified Text‑Patch Alignment (UTPA) to jointly optimize image-level detection and pixel-level localization. The method achieves state-of-the-art results across 15 industrial and medical benchmarks and demonstrates robust cross-domain generalization, supported by comprehensive ablations and efficiency analyses. This approach reduces reliance on manual templates and class-name priors, enabling privacy-friendly, scalable ZSAD with strong localization accuracy. The work highlights a practical, interpretable path to robust anomaly detection in data-scarce settings and complex semantic variations.

Abstract

Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP-CLIP. The key insight of ViP-CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, eliminating manual templates and class-name priors. This design enables our model to focus on precise abnormal regions, making it particularly valuable when category labels are ambiguous or privacy-constrained. Extensive experiments on 15 industrial and medical benchmarks demonstrate that ViP-CLIP achieves state-of-the-art performance and robust cross-domain generalization.

Paper Structure

This paper contains 46 sections, 9 equations, 22 figures, 21 tables.

Figures (22)

  • Figure 1: Comparison between prior CLIP-based methods and ViP$^{2}$-CLIP. ViP$^{2}$-CLIP introduces ViP-Prompt to replace fixed class-name tokens with image-conditioned prompts that fuse global and local cues, and it first employs a unified patch-level alignment within training-based CLIP models.
  • Figure 2: Framework of ViP$^{2}$-CLIP. ViP$^{2}$-CLIP first introduces ViP-Prompt to enhance cross-modal alignment: ViP-ICA injects global visual context into the prompts' embedding space, while ViP-FGP fuses local patch features to enhance the prompts' fine-grained perceptual capacity. Finally, the UTPA module performs unified alignment in multiple layers to jointly support image-level anomaly detection and pixel-level anomaly localization.
  • Figure 3: Visualization results of the attention maps from different prompts in our FGP module.
  • Figure 4: Visualization of anomaly maps of different ZSAD methods. Our proposed ViP$^{2}$-CLIP achieves the sharpest segmentations, capturing fine-grained defects in both industrial and medical datasets.
  • Figure 5: F1 gains of using visual-conditioned prompts compared to static learnable prompts.
  • ...and 17 more figures