Table of Contents
Fetching ...

Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection

Zhaoxiang Zhang, Hanqiu Deng, Jinan Bao, Xingyu Li

TL;DR

The paper tackles zero-shot anomaly detection by integrating visual references with language guidance in a CLIP framework. A dual-image enhancement builds a joint vision-language scoring system by using each image as a visual reference for the other, complemented by a test-time adaptation module with pseudo-anomaly synthesis. Key contributions include dual-image feature pairing, a V-V attention-based localization enhancement, and a training-free TTA mechanism that refines alignment. Experiments on MVTecAD and VisA show competitive performance with SOTA methods in both anomaly classification and localization, highlighting practical gains in open-world anomaly detection.

Abstract

Image Anomaly Detection has been a challenging task in Computer Vision field. The advent of Vision-Language models, particularly the rise of CLIP-based frameworks, has opened new avenues for zero-shot anomaly detection. Recent studies have explored the use of CLIP by aligning images with normal and prompt descriptions. However, the exclusive dependence on textual guidance often falls short, highlighting the critical importance of additional visual references. In this work, we introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. This dual-image strategy markedly enhanced both anomaly classification and localization performances. Furthermore, we have strengthened our model with a test-time adaptation module that incorporates synthesized anomalies to refine localization capabilities. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.

Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection

TL;DR

The paper tackles zero-shot anomaly detection by integrating visual references with language guidance in a CLIP framework. A dual-image enhancement builds a joint vision-language scoring system by using each image as a visual reference for the other, complemented by a test-time adaptation module with pseudo-anomaly synthesis. Key contributions include dual-image feature pairing, a V-V attention-based localization enhancement, and a training-free TTA mechanism that refines alignment. Experiments on MVTecAD and VisA show competitive performance with SOTA methods in both anomaly classification and localization, highlighting practical gains in open-world anomaly detection.

Abstract

Image Anomaly Detection has been a challenging task in Computer Vision field. The advent of Vision-Language models, particularly the rise of CLIP-based frameworks, has opened new avenues for zero-shot anomaly detection. Recent studies have explored the use of CLIP by aligning images with normal and prompt descriptions. However, the exclusive dependence on textual guidance often falls short, highlighting the critical importance of additional visual references. In this work, we introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. This dual-image strategy markedly enhanced both anomaly classification and localization performances. Furthermore, we have strengthened our model with a test-time adaptation module that incorporates synthesized anomalies to refine localization capabilities. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
Paper Structure (21 sections, 10 equations, 6 figures, 3 tables)

This paper contains 21 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of the Dual-Image Enhanced CLIP Anomaly Detection Model. Traditional approaches often depend on a single modality for anomaly detection, where (A) demonstrates the use of image embeddings, and (B) illustrates reliance on text prompts. Our proposed method, shown in (C), integrates both visual and textual information, utilizing a dual-image input to enrich the feature space for a more robust and comprehensive anomaly detection framework.
  • Figure 2: Overview of our framework for Dual Image Enhanced CLIP. The left part shows the feature extraction process from the vision and text encoder, and the right section shows the inference process. The snowflake denotes the modules are frozen, and the flame icon represents trainable modules.
  • Figure 3: Qualitative illustration of the comparison with AD results on MVTecAD and VisA. The top row illustrates the result solely using textual information. The middle row depicts detection results through paired queries' visual feature comparison. The bottom row showcases more robust results achieved by integrating both language and visual features, and the ground truth is marked with green boundaries.
  • Figure 4: Workflow of the test-time adaptation module. The module inputs patch tokens through a linear layer, aligning predictions on the adapted token with the zero-shot vision-language joint anomaly score. Pseudo-anomalous samples are compared with original samples to predict pseudo-anomaly masks. The flame icon denotes trainable components. $A^{T_M}$ denotes the prediction for the pseudo anomalies.
  • Figure 5: Ablation studies on the influence of reference images.
  • ...and 1 more figures