Domain Adaptation of Attention Heads for Zero-shot Anomaly Detection
Kiyoon Jeong, Jaehyuk Heo, Junyeong Son, Pilsung Kang
TL;DR
HeadCLIP tackles zero-shot anomaly detection under domain shift by jointly adapting both text and image encoders. It introduces Learnable Head Weights (LHW) to reweight Vision Transformer attention heads and a Joint Anomaly Score (JAS) to fuse pixel-level and image-level cues, enabling robust domain-specific anomaly detection without normal training data. Empirical results across 7 industrial and 10 medical datasets show HeadCLIP outperforms prior ZSAD methods by up to 4.9 percentage points in pixel-level mAD and 3.2 percentage points in image-level mAD, with strong qualitative localization improvements. The work demonstrates the practical value of principled cross-modal domain adaptation for real-world anomaly detection tasks where normal data is scarce or unavailable.
Abstract
Zero-shot anomaly detection (ZSAD) in images is an approach that can detect anomalies without access to normal samples, which can be beneficial in various realistic scenarios where model training is not possible. However, existing ZSAD research has shown limitations by either not considering domain adaptation of general-purpose backbone models to anomaly detection domains or by implementing only partial adaptation to some model components. In this paper, we propose HeadCLIP to overcome these limitations by effectively adapting both text and image encoders to the domain. HeadCLIP generalizes the concepts of normality and abnormality through learnable prompts in the text encoder, and introduces learnable head weights to the image encoder to dynamically adjust the features held by each attention head according to domain characteristics. Additionally, we maximize the effect of domain adaptation by introducing a joint anomaly score that utilizes domain-adapted pixel-level information for image-level anomaly detection. Experimental results using multiple real datasets in both industrial and medical domains show that HeadCLIP outperforms existing ZSAD techniques at both pixel and image levels. In the industrial domain, improvements of up to 4.9%p in pixel-level mean anomaly detection score (mAD) and up to 3.0%p in image-level mAD were achieved, with similar improvements (3.2%p, 3.1%p) in the medical domain.
