Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization

Hanqiu Deng; Zhaoxiang Zhang; Jinan Bao; Xingyu Li

Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization

Hanqiu Deng, Zhaoxiang Zhang, Jinan Bao, Xingyu Li

TL;DR

This work presents AnoCLIP, a zero-shot framework for unified anomaly detection and precise localization that overcomes CLIP's global-image bias by extracting local patch tokens through a training-free value-to-value attention path. It pairs local-aware CLIP features with a unified domain-aware state prompting strategy to achieve fine-grained vision-language alignment, further refined by a fast test-time adapter optimized with pseudo-labels and noise supervision. Across MVTecAD and VisA, AnoCLIP and its enhanced AnoCLIP+ version deliver state-of-the-art zero-shot performance in both anomaly localization and detection, with favorable efficiency compared to multi-scale baselines. The approach demonstrates that domain-aware prompts plus light-weight adaptation enable CLIP-based models to perform open-world anomaly localization without training data, offering practical benefits for industrial inspection and related tasks.

Abstract

Contrastive Language-Image Pre-training (CLIP) models have shown promising performance on zero-shot visual recognition tasks by learning visual representations under natural language supervision. Recent studies attempt the use of CLIP to tackle zero-shot anomaly detection by matching images with normal and abnormal state prompts. However, since CLIP focuses on building correspondence between paired text prompts and global image-level representations, the lack of fine-grained patch-level vision to text alignment limits its capability on precise visual anomaly localization. In this work, we propose AnoCLIP for zero-shot anomaly localization. In the visual encoder, we introduce a training-free value-wise attention mechanism to extract intrinsic local tokens of CLIP for patch-level local description. From the perspective of text supervision, we particularly design a unified domain-aware contrastive state prompting template for fine-grained vision-language matching. On top of the proposed AnoCLIP, we further introduce a test-time adaptation (TTA) mechanism to refine visual anomaly localization results, where we optimize a lightweight adapter in the visual encoder using AnoCLIP's pseudo-labels and noise-corrupted tokens. With both AnoCLIP and TTA, we significantly exploit the potential of CLIP for zero-shot anomaly localization and demonstrate the effectiveness of AnoCLIP on various datasets.

Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization

TL;DR

Abstract

Paper Structure (23 sections, 12 equations, 6 figures, 7 tables)

This paper contains 23 sections, 12 equations, 6 figures, 7 tables.

Introduction
Related Work
Vision-Language Models.
Anomaly Detection & Localization
Methodology
CLIP for Zero-shot Anomaly Recognition
Extract Local-aware Visual Tokens from CLIP
Domain-aware State Prompting
Test-time Adaptation for Anomaly Localization
Experiment
Experimental Setup
Datasets.
Metrics.
Implementation.
Performance
...and 8 more sections

Figures (6)

Figure 1: A review of present visual anomaly detection tasks. Fig. (A) shows that anomaly detection is proposed as a one-class classification task mvtecvisa, where a model is trained for each class in the dataset. As shown in Fig. (B), multi-class anomaly detection is proposed to improve the efficiency of model usage by training a model on normal images from multiple categories uniad. Fig. (C) shows a new challenging task, zero-sample anomaly detection, that allows the model to localize anomalous regions without touching any normal samples in any category. In this study, we focus on tackling this task with vision-language (VL) models clip that exhibit open-world intelligibility.
Figure 2: (a) Overview of zero-shot anomaly localization. (b) Our Architecture detail. The solid arrow indicates our AnoCLIP and the dash arrow indicates the procedure of AnoCLIP with TTA. The snow denotes the frozen module and the flame denotes the optimized module.
Figure 3: Detailed structure of our adapter. Weights denote the learnable paramters of the adapter. We optimize the discriminative objective $L_d$ with adapted patch tokens $\Phi(p)$ and $\Phi(\hat{p})$ and jointly perform pseudo-supervised optimization $L_p$ with $\Phi(p)$.
Figure 4: Visualization of zero-shot anomaly localization. The top row shows images from MVTec mvtec and the bottom row shows images from VisA visa. The red mask refers to our prediction results while the green contour is the ground truth.
Figure 5: Ablation on TTA epochs on MVTecAD and VisA. We measure AUROC, F1Max, and PRO results for AnoCLIP+ within 10 epochs on MVTecAD and VisA.
...and 1 more figures

Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization

TL;DR

Abstract

Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)