Table of Contents
Fetching ...

AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP

Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, S. Kevin Zhou

TL;DR

This work tackles zero-shot anomaly detection by addressing CLIP's Anomaly Unawareness in the text space. It introduces AA-CLIP, a two-stage adaptation using Residual Adapters to create anomaly-aware text anchors and to align patch-level visuals to these anchors, all while keeping the CLIP backbone frozen to preserve generalization. A Disentanglement Loss enforces independence between normal and anomaly anchors, enabling robust generalization to unseen classes, and multi-granularity patch features are used for precise localization. Empirically, AA-CLIP achieves state-of-the-art results on industrial and medical AD benchmarks with limited data (e.g., 2-shot) and remains competitive or superior with larger data, demonstrating efficient, scalable anomaly detection and localization.

Abstract

Anomaly detection (AD) identifies outliers for applications like defect and lesion detection. While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP's anomaly discrimination ability in both text and visual spaces while preserving its generalization capability. AA-CLIP is achieved through a straightforward yet effective two-stage approach: it first creates anomaly-aware text anchors to differentiate normal and abnormal semantics clearly, then aligns patch-level visual features with these anchors for precise anomaly localization. This two-stage strategy, with the help of residual adapters, gradually adapts CLIP in a controlled manner, achieving effective AD while maintaining CLIP's class knowledge. Extensive experiments validate AA-CLIP as a resource-efficient solution for zero-shot AD tasks, achieving state-of-the-art results in industrial and medical applications. The code is available at https://github.com/Mwxinnn/AA-CLIP.

AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP

TL;DR

This work tackles zero-shot anomaly detection by addressing CLIP's Anomaly Unawareness in the text space. It introduces AA-CLIP, a two-stage adaptation using Residual Adapters to create anomaly-aware text anchors and to align patch-level visuals to these anchors, all while keeping the CLIP backbone frozen to preserve generalization. A Disentanglement Loss enforces independence between normal and anomaly anchors, enabling robust generalization to unseen classes, and multi-granularity patch features are used for precise localization. Empirically, AA-CLIP achieves state-of-the-art results on industrial and medical AD benchmarks with limited data (e.g., 2-shot) and remains competitive or superior with larger data, demonstrating efficient, scalable anomaly detection and localization.

Abstract

Anomaly detection (AD) identifies outliers for applications like defect and lesion detection. While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP's anomaly discrimination ability in both text and visual spaces while preserving its generalization capability. AA-CLIP is achieved through a straightforward yet effective two-stage approach: it first creates anomaly-aware text anchors to differentiate normal and abnormal semantics clearly, then aligns patch-level visual features with these anchors for precise anomaly localization. This two-stage strategy, with the help of residual adapters, gradually adapts CLIP in a controlled manner, achieving effective AD while maintaining CLIP's class knowledge. Extensive experiments validate AA-CLIP as a resource-efficient solution for zero-shot AD tasks, achieving state-of-the-art results in industrial and medical applications. The code is available at https://github.com/Mwxinnn/AA-CLIP.

Paper Structure

This paper contains 20 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (Top) Examples illustrating CLIP's Anomaly Unawareness. Despite the obvious anomalies present in the images, image features have higher similarities to normal descriptions, rather than anomaly descriptions, mistakenly. This problem is enlarged with a low temperature $\tau$. (Bottom) Text Feature Similarity Heatmap among Normal and Anomaly Descriptions: Original CLIP vs. After Text Adaptation. Red indicates high similarity. In original CLIP, normal features exhibit strong similarity with anomaly features, whereas text adaptation successfully separates them, clarifying the semantic distinctions between normal and anomaly descriptions.
  • Figure 2: t-SNE Visualization of Text Features from Original CLIP vs. AA-CLIP. Each point represents a text feature encoded from a prompt. Original CLIP's normal and anomaly text features are intertwined, while our method effectively disentangles them. This disentanglement is generalizable to novel classes, validating the anomaly-awareness of our model.
  • Figure 3: The Two-Stage Training Pipeline of Anomaly-Aware CLIP. In the first stage, the text encoder of AA-CLIP is trained to identify anomaly-related semantics, helped by a disentangle loss. In the second stage, patch features are aligned with these text anchors. Both stages are achieved by the integration of Residual Adapters into the shallow layers of CLIP's backbone. This controlled adaptation enables CLIP to effectively distinguish anomalies, which forms our Anomaly-Aware CLIP.
  • Figure 4: Average Results (Top) and Results on BTAD (Bottom) of Different methods Trained on 2-, 16-, 64-shot per Class and Full Data of VisA. Our method shows high fitting efficiency, achieving strong results across all data scales.
  • Figure 5: Visualization of Anomaly Localization Results of Original CLIP radford2021learning, AnomalyCLIP zhou2023anomalyclip, VAND chen2023april and our AA-CLIP. Compared to previous methods, AA-CLIP demonstrates more reliable prediction capabilities in localizing anomaly.
  • ...and 1 more figures