Table of Contents
Fetching ...

ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation

Shengze Li, Jianjian Cao, Peng Ye, Yuhan Ding, Chongjun Tu, Tao Chen

TL;DR

ClipSAM addresses zero-shot anomaly segmentation by combining CLIP's semantic localization with SAM's fine-grained segmentation. The key idea is to localize anomalies with CLIP via Unified Multi-scale Cross-modal Interaction (UMCI) and then refine the results with SAM using prompts generated by Multi-level Mask Refinement (MMR). The approach delivers state-of-the-art performance on industrial datasets such as MVTec-AD and VisA, and shows strong generalization on additional datasets like MTD and KSDD2. This two-stage collaboration reduces both mislocalization and post-processing complexity, enabling robust zero-shot anomaly segmentation in real-world scenarios.

Abstract

Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across different inputs, leading to imprecise segmentation of local anomalous parts; 2) SAM tends to generate numerous redundant masks without proper prompt constraints, resulting in complex post-processing requirements. In this work, we innovatively propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation, which is further used as the prompt constraints for SAM to refine the anomaly segmentation results. In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting language with visual features at multiple scales of CLIP to reason anomaly positions. Then, we design a novel Multi-level Mask Refinement (MMR) module, which utilizes the positional information as multi-level prompts for SAM to acquire hierarchical levels of masks and merges them. Extensive experiments validate the effectiveness of our approach, achieving the optimal segmentation performance on the MVTec-AD and VisA datasets.

ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation

TL;DR

ClipSAM addresses zero-shot anomaly segmentation by combining CLIP's semantic localization with SAM's fine-grained segmentation. The key idea is to localize anomalies with CLIP via Unified Multi-scale Cross-modal Interaction (UMCI) and then refine the results with SAM using prompts generated by Multi-level Mask Refinement (MMR). The approach delivers state-of-the-art performance on industrial datasets such as MVTec-AD and VisA, and shows strong generalization on additional datasets like MTD and KSDD2. This two-stage collaboration reduces both mislocalization and post-processing complexity, enabling robust zero-shot anomaly segmentation in real-world scenarios.

Abstract

Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across different inputs, leading to imprecise segmentation of local anomalous parts; 2) SAM tends to generate numerous redundant masks without proper prompt constraints, resulting in complex post-processing requirements. In this work, we innovatively propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation, which is further used as the prompt constraints for SAM to refine the anomaly segmentation results. In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting language with visual features at multiple scales of CLIP to reason anomaly positions. Then, we design a novel Multi-level Mask Refinement (MMR) module, which utilizes the positional information as multi-level prompts for SAM to acquire hierarchical levels of masks and merges them. Extensive experiments validate the effectiveness of our approach, achieving the optimal segmentation performance on the MVTec-AD and VisA datasets.
Paper Structure (26 sections, 15 equations, 17 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 15 equations, 17 figures, 5 tables, 2 algorithms.

Figures (17)

  • Figure 1: Structural comparisons among different approaches for Zero-Shot Anomaly Segmentation. Top: CLIP-based approaches. Middle: SAM-based approaches. Bottom: Our ClipSAM approach that leverages the strengths of both CLIP and SAM methods.
  • Figure 2: Overview of the proposed ClipSAM framework. ClipSAM includes two main processes: using CLIP for localization and rough segmentation, and using positioning information to prompt SAM to refine the segmentation results. These processes contain two important components: the Unified Multi-scale Cross-modal Interaction (UMCI) module and the Multi-level Mask Refinement (MMR) module. The UMCI module is employed for the interaction of language features with visual features of different directions and scales, facilitating CLIP's ability to locate and segment anomaly objects. Meanwhile, the MMR module combines SAM, and uses point and box prompts extracted from location information to guide SAM to output the desired masks, and fuses them with the rough segmentation result obtained by CLIP.
  • Figure 3: The results produced by SAM with different spatial prompts. As we can see, constraining SAM with the spatial prompt that represents points and boxes as a whole leads to better results.
  • Figure 4: Comparison of visualization results among ClipSAM, CLIP-based, and SAM-based methods on the MVTec-AD dataset. Our ClipSAM performs much better on the location and boundary of the anomaly segmentation.
  • Figure 5: Visualization of the results of each step of our ClipSAM collaboration framework. ClipSAM first uses CLIP for rough segmentation and then uses SAM for refinement.
  • ...and 12 more figures