KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration
Chengyuan Li, Suyang Zhou, Jieping Kong, Lei Qi, Hui Xue
TL;DR
KAnoCLIP tackles zero-shot anomaly detection by addressing two CLIP-related limitations: reliance on manually crafted anomaly prompts and weak pixel-level segmentation. It introduces Knowledge-Driven Prompt Learning (KnPL), which builds a LLM-VQA knowledge base to generate learnable abnormal/normal prompts and guides their learning with a KD loss, removing fixed prompts and improving generalization. The framework also enhances local visual semantics with CLIP-VV and strengthens cross-modal integration via Bi-CMCI and Conv-Adapter, all optimized under a joint objective $\mathcal{L}_{total} = \alpha \mathcal{L}_{KD} + \beta \mathcal{L}_{global} + \gamma \mathcal{L}_{local}$ with $\alpha=\beta=\gamma=1$. Extensive experiments across 12 industrial and medical datasets show state-of-the-art ZSAD performance, with notable improvements in both image-level and pixel-level AUC, demonstrating strong generalization and practical impact for privacy-constrained or data-scarce settings. KAnoCLIP thus provides a scalable, knowledge-guided approach to zero-shot anomaly detection that effectively adapts to unseen anomaly classes while delivering precise localization.
Abstract
Zero-shot anomaly detection (ZSAD) identifies anomalies without needing training samples from the target dataset, essential for scenarios with privacy concerns or limited data. Vision-language models like CLIP show potential in ZSAD but have limitations: relying on manually crafted fixed textual descriptions or anomaly prompts is time-consuming and prone to semantic ambiguity, and CLIP struggles with pixel-level anomaly segmentation, focusing more on global semantics than local details. To address these limitations, We introduce KAnoCLIP, a novel ZSAD framework that leverages vision-language models. KAnoCLIP combines general knowledge from a Large Language Model (GPT-3.5) and fine-grained, image-specific knowledge from a Visual Question Answering system (Llama3) via Knowledge-Driven Prompt Learning (KnPL). KnPL uses a knowledge-driven (KD) loss function to create learnable anomaly prompts, removing the need for fixed text prompts and enhancing generalization. KAnoCLIP includes the CLIP visual encoder with V-V attention (CLIP-VV), Bi-Directional Cross-Attention for Multi-Level Cross-Modal Interaction (Bi-CMCI), and Conv-Adapter. These components preserve local visual semantics, improve local cross-modal fusion, and align global visual features with textual information, enhancing pixel-level anomaly detection. KAnoCLIP achieves state-of-the-art performance in ZSAD across 12 industrial and medical datasets, demonstrating superior generalization compared to existing methods.
