Table of Contents
Fetching ...

KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration

Chengyuan Li, Suyang Zhou, Jieping Kong, Lei Qi, Hui Xue

TL;DR

KAnoCLIP tackles zero-shot anomaly detection by addressing two CLIP-related limitations: reliance on manually crafted anomaly prompts and weak pixel-level segmentation. It introduces Knowledge-Driven Prompt Learning (KnPL), which builds a LLM-VQA knowledge base to generate learnable abnormal/normal prompts and guides their learning with a KD loss, removing fixed prompts and improving generalization. The framework also enhances local visual semantics with CLIP-VV and strengthens cross-modal integration via Bi-CMCI and Conv-Adapter, all optimized under a joint objective $\mathcal{L}_{total} = \alpha \mathcal{L}_{KD} + \beta \mathcal{L}_{global} + \gamma \mathcal{L}_{local}$ with $\alpha=\beta=\gamma=1$. Extensive experiments across 12 industrial and medical datasets show state-of-the-art ZSAD performance, with notable improvements in both image-level and pixel-level AUC, demonstrating strong generalization and practical impact for privacy-constrained or data-scarce settings. KAnoCLIP thus provides a scalable, knowledge-guided approach to zero-shot anomaly detection that effectively adapts to unseen anomaly classes while delivering precise localization.

Abstract

Zero-shot anomaly detection (ZSAD) identifies anomalies without needing training samples from the target dataset, essential for scenarios with privacy concerns or limited data. Vision-language models like CLIP show potential in ZSAD but have limitations: relying on manually crafted fixed textual descriptions or anomaly prompts is time-consuming and prone to semantic ambiguity, and CLIP struggles with pixel-level anomaly segmentation, focusing more on global semantics than local details. To address these limitations, We introduce KAnoCLIP, a novel ZSAD framework that leverages vision-language models. KAnoCLIP combines general knowledge from a Large Language Model (GPT-3.5) and fine-grained, image-specific knowledge from a Visual Question Answering system (Llama3) via Knowledge-Driven Prompt Learning (KnPL). KnPL uses a knowledge-driven (KD) loss function to create learnable anomaly prompts, removing the need for fixed text prompts and enhancing generalization. KAnoCLIP includes the CLIP visual encoder with V-V attention (CLIP-VV), Bi-Directional Cross-Attention for Multi-Level Cross-Modal Interaction (Bi-CMCI), and Conv-Adapter. These components preserve local visual semantics, improve local cross-modal fusion, and align global visual features with textual information, enhancing pixel-level anomaly detection. KAnoCLIP achieves state-of-the-art performance in ZSAD across 12 industrial and medical datasets, demonstrating superior generalization compared to existing methods.

KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration

TL;DR

KAnoCLIP tackles zero-shot anomaly detection by addressing two CLIP-related limitations: reliance on manually crafted anomaly prompts and weak pixel-level segmentation. It introduces Knowledge-Driven Prompt Learning (KnPL), which builds a LLM-VQA knowledge base to generate learnable abnormal/normal prompts and guides their learning with a KD loss, removing fixed prompts and improving generalization. The framework also enhances local visual semantics with CLIP-VV and strengthens cross-modal integration via Bi-CMCI and Conv-Adapter, all optimized under a joint objective with . Extensive experiments across 12 industrial and medical datasets show state-of-the-art ZSAD performance, with notable improvements in both image-level and pixel-level AUC, demonstrating strong generalization and practical impact for privacy-constrained or data-scarce settings. KAnoCLIP thus provides a scalable, knowledge-guided approach to zero-shot anomaly detection that effectively adapts to unseen anomaly classes while delivering precise localization.

Abstract

Zero-shot anomaly detection (ZSAD) identifies anomalies without needing training samples from the target dataset, essential for scenarios with privacy concerns or limited data. Vision-language models like CLIP show potential in ZSAD but have limitations: relying on manually crafted fixed textual descriptions or anomaly prompts is time-consuming and prone to semantic ambiguity, and CLIP struggles with pixel-level anomaly segmentation, focusing more on global semantics than local details. To address these limitations, We introduce KAnoCLIP, a novel ZSAD framework that leverages vision-language models. KAnoCLIP combines general knowledge from a Large Language Model (GPT-3.5) and fine-grained, image-specific knowledge from a Visual Question Answering system (Llama3) via Knowledge-Driven Prompt Learning (KnPL). KnPL uses a knowledge-driven (KD) loss function to create learnable anomaly prompts, removing the need for fixed text prompts and enhancing generalization. KAnoCLIP includes the CLIP visual encoder with V-V attention (CLIP-VV), Bi-Directional Cross-Attention for Multi-Level Cross-Modal Interaction (Bi-CMCI), and Conv-Adapter. These components preserve local visual semantics, improve local cross-modal fusion, and align global visual features with textual information, enhancing pixel-level anomaly detection. KAnoCLIP achieves state-of-the-art performance in ZSAD across 12 industrial and medical datasets, demonstrating superior generalization compared to existing methods.
Paper Structure (16 sections, 15 equations, 4 figures, 3 tables)

This paper contains 16 sections, 15 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of KAnoCLIP, which consists of four key components: KnPL, CLIP-VV, Bi-CMCI, and Conv-Adapter. KnPL uses an LLM and VQA system to form an LLM-VQA knowledge base, guiding the generation of learnable normal and abnormal prompts (LNPs and LAPs), reducing overfitting and enhancing generalization. The CLIP-VV visual encoder captures local visual details with V-V attention, while Conv-Adapter and Bi-CMCI provide comprehensive cross-modal fusion of global and local features. The red dashed line represents the $\mathcal{L}_{\text{KD}}$ loss function introduced by KnPL, guiding LNPs and LAPs learning during training.
  • Figure 2: Constructing the LLM-VQA Knowledge Base: Generating Potential Anomalies and Image-Specific Descriptions.
  • Figure 3: BiCA.
  • Figure 4: Conv-Adapter.