Table of Contents
Fetching ...

GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection

Jiyul Ham, Yonggon Jung, Jun-Geol Baek

TL;DR

GlocalCLIP addresses zero-shot anomaly detection under domain shifts and data scarcity by explicitly separating global and local prompts in a vision-language framework. It introduces object-agnostic glocal semantic prompts, deep-text prompt tuning, and a novel glocal contrastive learning objective to align global and local representations, aided by V-V attention for local detail. Across 15 real-world industrial and medical datasets, GlocalCLIP achieves state-of-the-art performance in both anomaly detection and localization and demonstrates strong cross-domain generalization. The approach offers a practical, scalable solution for robust visual anomaly detection without target-domain training data, with potential to bridge image and text modalities in real-world inspection tasks.

Abstract

Zero-shot anomaly detection (ZSAD) is crucial for detecting anomalous patterns in target datasets without using training samples, specifically in scenarios where there are distributional differences between the target domain and training data or where data scarcity arises because of restricted access. Although recently pretrained vision-language models demonstrate strong zero-shot performance across various visual tasks, they focus on learning class semantics, which makes their direct application to ZSAD challenging. To address this scenario, we propose GlocalCLIP, which uniquely separates global and local prompts and jointly optimizes them. This approach enables the object-agnostic glocal semantic prompt to effectively capture general normal and anomalous patterns without dependency on specific objects in the image. We refine the text prompts for more precise adjustments by utilizing deep-text prompt tuning in the text encoder. In the vision encoder, we apply V-V attention layers to capture detailed local image features. Finally, we introduce glocal contrastive learning to improve the complementary learning of global and local prompts, effectively detecting anomalous patterns across various domains. The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets from both the industrial and medical domains, achieving superior performance compared to existing methods. Code will be made available at https://github.com/YUL-git/GlocalCLIP.

GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection

TL;DR

GlocalCLIP addresses zero-shot anomaly detection under domain shifts and data scarcity by explicitly separating global and local prompts in a vision-language framework. It introduces object-agnostic glocal semantic prompts, deep-text prompt tuning, and a novel glocal contrastive learning objective to align global and local representations, aided by V-V attention for local detail. Across 15 real-world industrial and medical datasets, GlocalCLIP achieves state-of-the-art performance in both anomaly detection and localization and demonstrates strong cross-domain generalization. The approach offers a practical, scalable solution for robust visual anomaly detection without target-domain training data, with potential to bridge image and text modalities in real-world inspection tasks.

Abstract

Zero-shot anomaly detection (ZSAD) is crucial for detecting anomalous patterns in target datasets without using training samples, specifically in scenarios where there are distributional differences between the target domain and training data or where data scarcity arises because of restricted access. Although recently pretrained vision-language models demonstrate strong zero-shot performance across various visual tasks, they focus on learning class semantics, which makes their direct application to ZSAD challenging. To address this scenario, we propose GlocalCLIP, which uniquely separates global and local prompts and jointly optimizes them. This approach enables the object-agnostic glocal semantic prompt to effectively capture general normal and anomalous patterns without dependency on specific objects in the image. We refine the text prompts for more precise adjustments by utilizing deep-text prompt tuning in the text encoder. In the vision encoder, we apply V-V attention layers to capture detailed local image features. Finally, we introduce glocal contrastive learning to improve the complementary learning of global and local prompts, effectively detecting anomalous patterns across various domains. The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets from both the industrial and medical domains, achieving superior performance compared to existing methods. Code will be made available at https://github.com/YUL-git/GlocalCLIP.

Paper Structure

This paper contains 41 sections, 10 equations, 38 figures, 22 tables.

Figures (38)

  • Figure 1: (a) The refinement of prompt design, showing how normal and anomaly prompts are transformed into global and local semantic prompts. (b) Spider chart comparing pixel-level AUPRO scores across differenct CLIP-based methods on various datasets.
  • Figure 2: Overview of GlocalCLIP. The object-agnostic glocal semantic prompt enable the text encoder to extract complementary embeddings. Glocal contrastive learning aligns these embeddings to enhance anomaly detection performance. The model optimizes global and local margins to generate anomaly scores and similarity maps, effectively identifying abnormal regions.
  • Figure 3: Comparison of ZSAD results across industrial and medical domains. The first row displays input images from the industrial domain (Hazelnut, Bottle, Metal plate, Leather, Pcb1, Blotchy, and Electrical commutators) and the medical domain (HeadCT, BrainMRI, Endo). The second row presents the ground truth anomaly regions for each image. The remaining rows show the anomaly heatmaps generated by different models: CLIP, WinCLIP, CoOp, AnomalyCLIP, AdaCLIP, and GlocalCLIP.
  • Figure 3: Module ablation
  • Figure 4: Visualization of anomaly localization maps using global prompts with and without GCL. The first row shows sample images from the industrial domain, and the second row provides the true anomaly regions. The third row displays localization maps generated without GCL, where the global prompt struggles to precisely localize pixel-level anomalies. The last row shows localization maps generated with GCL, where the model demonstrates improved detection of both global and local anomalies, effectively localizing fine-grained anomalous regions.
  • ...and 33 more figures