Table of Contents
Fetching ...

VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

Zhen Qu, Xian Tao, Mukesh Prasad, Fei Shen, Zhengtao Zhang, Xinyi Gong, Guiguang Ding

TL;DR

VCP-CLIP addresses zero-shot anomaly segmentation by introducing visual context prompting to CLIP. The Pre-VCP module injects global image context into text prompts, while the Post-VCP module leverages fine-grained image features to refine text embeddings via cross-modal attention, enabling robust anomaly localization on unseen products without product-specific prompts. Across 10 real industrial datasets, VCP-CLIP achieves state-of-the-art zero-shot performance, with strong improvements in AP and robust prompt generalization during testing. The approach reduces reliance on handcrafted prompts and demonstrates practical potential for privacy-preserving, data-efficient industrial defect inspection.

Abstract

Recently, large-scale vision-language models such as CLIP have demonstrated immense potential in zero-shot anomaly segmentation (ZSAS) task, utilizing a unified model to directly detect anomalies on any unseen product with painstakingly crafted text prompts. However, existing methods often assume that the product category to be inspected is known, thus setting product-specific text prompts, which is difficult to achieve in the data privacy scenarios. Moreover, even the same type of product exhibits significant differences due to specific components and variations in the production process, posing significant challenges to the design of text prompts. In this end, we propose a visual context prompting model (VCP-CLIP) for ZSAS task based on CLIP. The insight behind VCP-CLIP is to employ visual context prompting to activate CLIP's anomalous semantic perception ability. In specific, we first design a Pre-VCP module to embed global visual information into the text prompt, thus eliminating the necessity for product-specific prompts. Then, we propose a novel Post-VCP module, that adjusts the text embeddings utilizing the fine-grained features of the images. In extensive experiments conducted on 10 real-world industrial anomaly segmentation datasets, VCP-CLIP achieved state-of-the-art performance in ZSAS task. The code is available at https://github.com/xiaozhen228/VCP-CLIP.

VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

TL;DR

VCP-CLIP addresses zero-shot anomaly segmentation by introducing visual context prompting to CLIP. The Pre-VCP module injects global image context into text prompts, while the Post-VCP module leverages fine-grained image features to refine text embeddings via cross-modal attention, enabling robust anomaly localization on unseen products without product-specific prompts. Across 10 real industrial datasets, VCP-CLIP achieves state-of-the-art zero-shot performance, with strong improvements in AP and robust prompt generalization during testing. The approach reduces reliance on handcrafted prompts and demonstrates practical potential for privacy-preserving, data-efficient industrial defect inspection.

Abstract

Recently, large-scale vision-language models such as CLIP have demonstrated immense potential in zero-shot anomaly segmentation (ZSAS) task, utilizing a unified model to directly detect anomalies on any unseen product with painstakingly crafted text prompts. However, existing methods often assume that the product category to be inspected is known, thus setting product-specific text prompts, which is difficult to achieve in the data privacy scenarios. Moreover, even the same type of product exhibits significant differences due to specific components and variations in the production process, posing significant challenges to the design of text prompts. In this end, we propose a visual context prompting model (VCP-CLIP) for ZSAS task based on CLIP. The insight behind VCP-CLIP is to employ visual context prompting to activate CLIP's anomalous semantic perception ability. In specific, we first design a Pre-VCP module to embed global visual information into the text prompt, thus eliminating the necessity for product-specific prompts. Then, we propose a novel Post-VCP module, that adjusts the text embeddings utilizing the fine-grained features of the images. In extensive experiments conducted on 10 real-world industrial anomaly segmentation datasets, VCP-CLIP achieved state-of-the-art performance in ZSAS task. The code is available at https://github.com/xiaozhen228/VCP-CLIP.
Paper Structure (26 sections, 10 equations, 33 figures, 16 tables)

This paper contains 26 sections, 10 equations, 33 figures, 16 tables.

Figures (33)

  • Figure 1: A comparison between existing CLIP-based methods and VCP-CLIP. VCP-CLIP introduces a Pre-VCP module and a Post-VCP module, offering a distinct enhancement over existing CLIP-based methods. (a) Existing CLIP-based methods. (b) VCP-CLIP
  • Figure 2: Comparison of different text prompting methods. (a) Task setting. (b) Manually defined text prompting. (c) Designed unified text prompting. (d) Designed pre-visual context prompting.
  • Figure 3: Framework of VCP-CLIP. Our approach incorporates richer visual knowledge into the textual space, and cross-modal interaction between textual and visual features by using a Pre-VCP module and a Post-VCP module.
  • Figure 4: The visualization result of the attention maps from the Post-VCP module.
  • Figure 5: Qualitative segmentation results. The first five columns use images from the MVTec-AD dataset, and the last five are from the VisA dataset.
  • ...and 28 more figures