Table of Contents
Fetching ...

Domain Adaptation of Attention Heads for Zero-shot Anomaly Detection

Kiyoon Jeong, Jaehyuk Heo, Junyeong Son, Pilsung Kang

TL;DR

HeadCLIP tackles zero-shot anomaly detection under domain shift by jointly adapting both text and image encoders. It introduces Learnable Head Weights (LHW) to reweight Vision Transformer attention heads and a Joint Anomaly Score (JAS) to fuse pixel-level and image-level cues, enabling robust domain-specific anomaly detection without normal training data. Empirical results across 7 industrial and 10 medical datasets show HeadCLIP outperforms prior ZSAD methods by up to 4.9 percentage points in pixel-level mAD and 3.2 percentage points in image-level mAD, with strong qualitative localization improvements. The work demonstrates the practical value of principled cross-modal domain adaptation for real-world anomaly detection tasks where normal data is scarce or unavailable.

Abstract

Zero-shot anomaly detection (ZSAD) in images is an approach that can detect anomalies without access to normal samples, which can be beneficial in various realistic scenarios where model training is not possible. However, existing ZSAD research has shown limitations by either not considering domain adaptation of general-purpose backbone models to anomaly detection domains or by implementing only partial adaptation to some model components. In this paper, we propose HeadCLIP to overcome these limitations by effectively adapting both text and image encoders to the domain. HeadCLIP generalizes the concepts of normality and abnormality through learnable prompts in the text encoder, and introduces learnable head weights to the image encoder to dynamically adjust the features held by each attention head according to domain characteristics. Additionally, we maximize the effect of domain adaptation by introducing a joint anomaly score that utilizes domain-adapted pixel-level information for image-level anomaly detection. Experimental results using multiple real datasets in both industrial and medical domains show that HeadCLIP outperforms existing ZSAD techniques at both pixel and image levels. In the industrial domain, improvements of up to 4.9%p in pixel-level mean anomaly detection score (mAD) and up to 3.0%p in image-level mAD were achieved, with similar improvements (3.2%p, 3.1%p) in the medical domain.

Domain Adaptation of Attention Heads for Zero-shot Anomaly Detection

TL;DR

HeadCLIP tackles zero-shot anomaly detection under domain shift by jointly adapting both text and image encoders. It introduces Learnable Head Weights (LHW) to reweight Vision Transformer attention heads and a Joint Anomaly Score (JAS) to fuse pixel-level and image-level cues, enabling robust domain-specific anomaly detection without normal training data. Empirical results across 7 industrial and 10 medical datasets show HeadCLIP outperforms prior ZSAD methods by up to 4.9 percentage points in pixel-level mAD and 3.2 percentage points in image-level mAD, with strong qualitative localization improvements. The work demonstrates the practical value of principled cross-modal domain adaptation for real-world anomaly detection tasks where normal data is scarce or unavailable.

Abstract

Zero-shot anomaly detection (ZSAD) in images is an approach that can detect anomalies without access to normal samples, which can be beneficial in various realistic scenarios where model training is not possible. However, existing ZSAD research has shown limitations by either not considering domain adaptation of general-purpose backbone models to anomaly detection domains or by implementing only partial adaptation to some model components. In this paper, we propose HeadCLIP to overcome these limitations by effectively adapting both text and image encoders to the domain. HeadCLIP generalizes the concepts of normality and abnormality through learnable prompts in the text encoder, and introduces learnable head weights to the image encoder to dynamically adjust the features held by each attention head according to domain characteristics. Additionally, we maximize the effect of domain adaptation by introducing a joint anomaly score that utilizes domain-adapted pixel-level information for image-level anomaly detection. Experimental results using multiple real datasets in both industrial and medical domains show that HeadCLIP outperforms existing ZSAD techniques at both pixel and image levels. In the industrial domain, improvements of up to 4.9%p in pixel-level mean anomaly detection score (mAD) and up to 3.0%p in image-level mAD were achieved, with similar improvements (3.2%p, 3.1%p) in the medical domain.

Paper Structure

This paper contains 38 sections, 12 equations, 4 figures, 26 tables.

Figures (4)

  • Figure 1: Comparison of domain adaptation approaches in CLIP-based anomaly detection methods: (a) AnoVL, (b) AnomalyCLIP, (c) AdaCLIP, and (d) HeadCLIP. The fire emoji indicates where domain adaptation is performed, with our method achieving more comprehensive adaptation of both modalities.
  • Figure 2: (a) Overview of the proposed anomaly detection framework. The model comprises a global path (top), which employs standard self-attention, and a local path (bottom), which applies consistent self-attention (CSA) enhanced with learnable head weights (LHW) to extract fine-grained features. A text encoder processes both normal and abnormal prompts to guide semantic understanding. The resulting global and local features are fused to produce an anomaly map and a global anomaly score, which are integrated into the Joint Anomaly Score (JAS). (b) Architecture of the multi-head CSA module. Each attention head is modulated by a learnable weight, and the outputs are concatenated to form a unified local representation. (c) Computation of the JAS by combining the global anomaly score with the top-$k\%$ average of the pixel-wise anomaly map.
  • Figure 3: Qualitative comparison of pixel-level anomaly detection results across different models. The leftmost column shows original input images from various domains, along with their corresponding ground truth masks. The comparison demonstrates that HeadCLIP achieves more precise anomaly localization and generates clearer anomaly maps compared to previous methods. These results validate the effectiveness of our method in achieving robust anomaly detection across diverse domains.
  • Figure 4: Image-level mean anomaly detection (mAD) performance as a function of the joint anomaly score ratio (x-axis) on industrial (left) and medical (right) datasets. The plots compare four methods with the y-axis showing image-level mAD. The results illustrate how varying the joint anomaly score ratio influences the overall detection performance for each approach.