Table of Contents
Fetching ...

CLIP-FSAC++: Few-Shot Anomaly Classification with Anomaly Descriptor Based on CLIP

Zuo Zuo, Jiahao Dong, Yao Wu, Yanyun Qu, Zongze Wu

TL;DR

This work tackles industrial anomaly classification under severe data scarcity by retooling CLIP for few-shot detection. It introduces CLIP-FSAC++, which adds image and text adapters and a novel Anomaly Descriptor with cross-modality attention to align visual and textual representations in a one-stage training setup. Synthetic anomalies are generated to enable effective contrastive learning, and a compositional text-prompt ensemble further improves cross-modal matching. Empirical results on VisA and MVTEC-AD show state-of-the-art performance in 1-, 2-, 4-, and 8-shot settings, with strong robustness across ablations and analyses, highlighting the practical impact of cross-modal fusion in industrial anomaly tasks.

Abstract

Industrial anomaly classification (AC) is an indispensable task in industrial manufacturing, which guarantees quality and safety of various product. To address the scarcity of data in industrial scenarios, lots of few-shot anomaly detection methods emerge recently. In this paper, we propose an effective few-shot anomaly classification (FSAC) framework with one-stage training, dubbed CLIP-FSAC++. Specifically, we introduce a cross-modality interaction module named Anomaly Descriptor following image and text encoders, which enhances the correlation of visual and text embeddings and adapts the representations of CLIP from pre-trained data to target data. In anomaly descriptor, image-to-text cross-attention module is used to obtain image-specific text embeddings and text-to-image cross-attention module is used to obtain text-specific visual embeddings. Then these modality-specific embeddings are used to enhance original representations of CLIP for better matching ability. Comprehensive experiment results are provided for evaluating our method in few-normal shot anomaly classification on VisA and MVTEC-AD for 1, 2, 4 and 8-shot settings. The source codes are at https://github.com/Jay-zzcoder/clip-fsac-pp

CLIP-FSAC++: Few-Shot Anomaly Classification with Anomaly Descriptor Based on CLIP

TL;DR

This work tackles industrial anomaly classification under severe data scarcity by retooling CLIP for few-shot detection. It introduces CLIP-FSAC++, which adds image and text adapters and a novel Anomaly Descriptor with cross-modality attention to align visual and textual representations in a one-stage training setup. Synthetic anomalies are generated to enable effective contrastive learning, and a compositional text-prompt ensemble further improves cross-modal matching. Empirical results on VisA and MVTEC-AD show state-of-the-art performance in 1-, 2-, 4-, and 8-shot settings, with strong robustness across ablations and analyses, highlighting the practical impact of cross-modal fusion in industrial anomaly tasks.

Abstract

Industrial anomaly classification (AC) is an indispensable task in industrial manufacturing, which guarantees quality and safety of various product. To address the scarcity of data in industrial scenarios, lots of few-shot anomaly detection methods emerge recently. In this paper, we propose an effective few-shot anomaly classification (FSAC) framework with one-stage training, dubbed CLIP-FSAC++. Specifically, we introduce a cross-modality interaction module named Anomaly Descriptor following image and text encoders, which enhances the correlation of visual and text embeddings and adapts the representations of CLIP from pre-trained data to target data. In anomaly descriptor, image-to-text cross-attention module is used to obtain image-specific text embeddings and text-to-image cross-attention module is used to obtain text-specific visual embeddings. Then these modality-specific embeddings are used to enhance original representations of CLIP for better matching ability. Comprehensive experiment results are provided for evaluating our method in few-normal shot anomaly classification on VisA and MVTEC-AD for 1, 2, 4 and 8-shot settings. The source codes are at https://github.com/Jay-zzcoder/clip-fsac-pp

Paper Structure

This paper contains 33 sections, 18 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Different diagrams for few-shot anomaly detection. (a) few-shot unsupervised AC in meta learning. (b) few-shot unsupervised AC based on vision isometric invariant GNN using memory bank. Our proposed (c) leverages alignment capability between image and text of large vision-language model and fine-tunes it for few-shot AD without extra memory bank and massive normal samples.
  • Figure 2: The framework of CLIP-FSAC++. CLIP-AC indicates zero-shot anomaly classification with original CLIP. $f$ and $g$ are image and text encoders of CLIP, $A_f$ and $A_g$ are image and text adapters.
  • Figure 3: Synthetic anomalies. (a) random perturbation. (b) NSA.
  • Figure 4: Architecture of anomaly descriptor.
  • Figure 5: Visualization of grad maps and ground truth. Yellow regions in GT denote anomalies.
  • ...and 7 more figures