Table of Contents
Fetching ...

Defect-aware Hybrid Prompt Optimization via Progressive Tuning for Zero-Shot Multi-type Anomaly Detection and Segmentation

Nadeem Nazer, Hongkuan Zhou, Lavdim Halilaj, Ylli Sadikaj, Steffen Staab

TL;DR

This work tackles zero-shot multi-type anomaly detection and segmentation in manufacturing under distribution shift by integrating defect-aware prompts with progressive tuning. The core idea, DAPO, combines fixed textual anchors with shared learnable defect tokens to align image regions with defect semantics, while progressively updating both text and image Prompts. The method achieves strong performance on multiple industrial datasets, particularly for unseen defect types, and demonstrates robustness through comprehensive ablations and visual analyses. Practically, DAPO reduces manual prompt engineering and improves interpretability by linking specific defect types to learned textual representations, enabling scalable defect localization and characterization in real-world settings.

Abstract

Recent vision language models (VLMs) like CLIP have demonstrated impressive anomaly detection performance under significant distribution shift by utilizing high-level semantic information through text prompts. However, these models often neglect fine-grained details, such as which kind of anomalies, like "hole", "cut", "scratch" that could provide more specific insight into the nature of anomalies. We argue that recognizing fine-grained anomaly types 1) enriches the representation of "abnormal" with structured semantics, narrowing the gap between coarse anomaly signals and fine-grained defect categories; 2) enables manufacturers to understand the root causes of the anomaly and implement more targeted and appropriate corrective measures quickly. While incorporating such detailed semantic information is crucial, designing handcrafted prompts for each defect type is both time-consuming and susceptible to human bias. For this reason, we introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts. Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. We conducted experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, and Real-IAD) and an internal dataset. The results suggest that compared to the baseline models, DAPO achieves a 3.7% average improvement in AUROC and average precision metrics at the image level under distribution shift, and a 6.5% average improvement in localizing novel anomaly types under zero-shot settings.

Defect-aware Hybrid Prompt Optimization via Progressive Tuning for Zero-Shot Multi-type Anomaly Detection and Segmentation

TL;DR

This work tackles zero-shot multi-type anomaly detection and segmentation in manufacturing under distribution shift by integrating defect-aware prompts with progressive tuning. The core idea, DAPO, combines fixed textual anchors with shared learnable defect tokens to align image regions with defect semantics, while progressively updating both text and image Prompts. The method achieves strong performance on multiple industrial datasets, particularly for unseen defect types, and demonstrates robustness through comprehensive ablations and visual analyses. Practically, DAPO reduces manual prompt engineering and improves interpretability by linking specific defect types to learned textual representations, enabling scalable defect localization and characterization in real-world settings.

Abstract

Recent vision language models (VLMs) like CLIP have demonstrated impressive anomaly detection performance under significant distribution shift by utilizing high-level semantic information through text prompts. However, these models often neglect fine-grained details, such as which kind of anomalies, like "hole", "cut", "scratch" that could provide more specific insight into the nature of anomalies. We argue that recognizing fine-grained anomaly types 1) enriches the representation of "abnormal" with structured semantics, narrowing the gap between coarse anomaly signals and fine-grained defect categories; 2) enables manufacturers to understand the root causes of the anomaly and implement more targeted and appropriate corrective measures quickly. While incorporating such detailed semantic information is crucial, designing handcrafted prompts for each defect type is both time-consuming and susceptible to human bias. For this reason, we introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts. Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. We conducted experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, and Real-IAD) and an internal dataset. The results suggest that compared to the baseline models, DAPO achieves a 3.7% average improvement in AUROC and average precision metrics at the image level under distribution shift, and a 6.5% average improvement in localizing novel anomaly types under zero-shot settings.

Paper Structure

This paper contains 35 sections, 11 equations, 19 figures, 26 tables.

Figures (19)

  • Figure 1: Existing methods (top) produce embeddings with insufficient fine-grained discriminability, suppressing cues needed to separate anomalies from normal patterns. DAPO (bottom) improves embedding representations through defect-aware progressive learning, refining them into fine-grained clusters across defect types without significant manual prompt engineering.
  • Figure 2: Overview of the proposed Defect-Aware Prompt Learning (DAPO) architecture. To enable zero-shot multi-type anomaly detection and segmentation, we progressively optimize a set of shared learnable prompts to align image features with defect-aware semantic representations. DAPO adapts pre-trained vision-language models to industrial scenarios by conditioning prompts on both global and local visual features. The shared prompt tokens are updated through a global learning strategy and reused across image patches, enabling the model to distinguish between normal and multiple defect types under distribution shifts.
  • Figure 3: Anomaly segmentation results from DAPO on VisA, MVTec-AD, and the internal dataset. Top: unseen multi-defect cases (PCB, Capsules). Bottom: illustration of the contamination defect in the internal semiconductor dataset.
  • Figure 4: Binary anomaly detection performance comparison. (a) ROC curves for DAPO, AnomalyCLIP, and Random Classifier. (b) and (c) are confusion matrices, demonstrating classification accuracy, with our approach, DAPO, achieving the best overall performance compared to AnomalyCLIP.
  • Figure 5: Hyperparameter analysis. a) weight balance $\lambda$, b) length $l$. Pixel-(top) and Image-(bottom)-level AUROC and AUPRO/AP are shown on the left and right of each subplot.
  • ...and 14 more figures