Table of Contents
Fetching ...

GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection

Donghyeong Kim, Chaewon Park, Suhwan Cho, Hyeonjeong Lim, Minseok Kang, Jungho Lee, Sangyoun Lee

TL;DR

GenCLIP tackles zero-shot anomaly detection by learning stable, general text prompts augmented with multi-layer CLIP visual features through MVPs, coupled with a dual-branch inference that balances generalization and category-specific cues. The framework introduces General Query Prompt tokens (GQPs), Multi-layer Vision Prompt tokens (MVPs), and Class Name Filtering (CNF) to enhance vision-language alignment across unseen object classes, training with focal and Dice losses on layer-wise anomaly maps. Inference combines a vision-enhanced branch and a query-only branch to produce robust pixel-level segmentation and image-level detection, achieving state-of-the-art performance on six industrial datasets. This approach offers strong generalization and precise localization, enabling practical zero-shot anomaly detection in varied manufacturing contexts.

Abstract

Zero-shot anomaly detection (ZSAD) aims to identify anomalies in unseen categories by leveraging CLIP's zero-shot capabilities to match text prompts with visual features. A key challenge in ZSAD is learning general prompts stably and utilizing them effectively, while maintaining both generalizability and category specificity. Although general prompts have been explored in prior works, achieving their stable optimization and effective deployment remains a significant challenge. In this work, we propose GenCLIP, a novel framework that learns and leverages general prompts more effectively through multi-layer prompting and dual-branch inference. Multi-layer prompting integrates category-specific visual cues from different CLIP layers, enriching general prompts with more comprehensive and robust feature representations. By combining general prompts with multi-layer visual features, our method further enhances its generalization capability. To balance specificity and generalization, we introduce a dual-branch inference strategy, where a vision-enhanced branch captures fine-grained category-specific features, while a query-only branch prioritizes generalization. The complementary outputs from both branches improve the stability and reliability of anomaly detection across unseen categories. Additionally, we propose an adaptive text prompt filtering mechanism, which removes irrelevant or atypical class names not encountered during CLIP's training, ensuring that only meaningful textual inputs contribute to the final vision-language alignment.

GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection

TL;DR

GenCLIP tackles zero-shot anomaly detection by learning stable, general text prompts augmented with multi-layer CLIP visual features through MVPs, coupled with a dual-branch inference that balances generalization and category-specific cues. The framework introduces General Query Prompt tokens (GQPs), Multi-layer Vision Prompt tokens (MVPs), and Class Name Filtering (CNF) to enhance vision-language alignment across unseen object classes, training with focal and Dice losses on layer-wise anomaly maps. Inference combines a vision-enhanced branch and a query-only branch to produce robust pixel-level segmentation and image-level detection, achieving state-of-the-art performance on six industrial datasets. This approach offers strong generalization and precise localization, enabling practical zero-shot anomaly detection in varied manufacturing contexts.

Abstract

Zero-shot anomaly detection (ZSAD) aims to identify anomalies in unseen categories by leveraging CLIP's zero-shot capabilities to match text prompts with visual features. A key challenge in ZSAD is learning general prompts stably and utilizing them effectively, while maintaining both generalizability and category specificity. Although general prompts have been explored in prior works, achieving their stable optimization and effective deployment remains a significant challenge. In this work, we propose GenCLIP, a novel framework that learns and leverages general prompts more effectively through multi-layer prompting and dual-branch inference. Multi-layer prompting integrates category-specific visual cues from different CLIP layers, enriching general prompts with more comprehensive and robust feature representations. By combining general prompts with multi-layer visual features, our method further enhances its generalization capability. To balance specificity and generalization, we introduce a dual-branch inference strategy, where a vision-enhanced branch captures fine-grained category-specific features, while a query-only branch prioritizes generalization. The complementary outputs from both branches improve the stability and reliability of anomaly detection across unseen categories. Additionally, we propose an adaptive text prompt filtering mechanism, which removes irrelevant or atypical class names not encountered during CLIP's training, ensuring that only meaningful textual inputs contribute to the final vision-language alignment.

Paper Structure

This paper contains 30 sections, 16 equations, 22 figures, 15 tables.

Figures (22)

  • Figure 1: Paradigms of CLIP prompt learning-based zero-shot anomaly detection. (a) Methods that make generalizable query text prompts (b) Methods that adapt high-level CLIP vision features to facilitate text prompts. (c) Our approach that leverages multi-layer features from CLIP visual encoder to augment text embeddings.
  • Figure 2: The framework of GenCLIP. It consists of a CLIP vision encoder and a text encoder. The normal and abnormal prompts include individually learnable parameters $\mathbf{N_p}, \mathbf{A_p}$ and sharing learnable parameters $\mathbf{Q_P}$. During training, all modules except for CNF and $\mathbf{F_Q}$ are used. Given an image $\mathbf{I}$ and texts $\mathbf{T_N}$ and $\mathbf{T_A}$, GenCLIP outputs layer-wise score maps $\mathbf{S^i_V}$ by computing the similarity between the vision and text features. During inference, GenCLIP utilizes a two-branch inference strategy: Vision-enhanced branch and Query-only branch at the bottom of the figure. CNF is used only at the vision enhanced-branch.
  • Figure 3: (a) The architecture of CNF. CNF utilizes the frozen CLIP image encoder and text encoder to replace ambiguous class names with the generic term "object," eliminating unnecessary information and preventing potential confusion in the text encoder. (b) Text prompts after CNF. Class-aware text prompts $\mathbf{T^i_V}$ and the general text prompt $\mathbf{T_Q}$ are input to GenCLIP. $\mathbf{T_Q}$ is a unified prompt regardless of class.
  • Figure 4: Qualitative comparison against Adaclip and AnomalyCLIP on our representative datasets.
  • Figure 5: t-SNE visualization of text features $\mathbf{F_T^{\mathit{i}}}$ and $\mathbf{F_Q}$ for (a) MVTec and (b) VisA test datasets.
  • ...and 17 more figures