VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

Yanning Hou; Peiyuan Li; Zirui Liu; Yitong Wang; Yanran Ruan; Jianfeng Qiu; Ke Xu

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

Yanning Hou, Peiyuan Li, Zirui Liu, Yitong Wang, Yanran Ruan, Jianfeng Qiu, Ke Xu

TL;DR

This work revisits the necessity of the text branch in ZSAD and presents VisualAD, a purely visual framework built on Vision Transformers, which achieves state-of-the-art performance on 13 zero-shot anomaly detection benchmarks spanning industrial and medical domains, and adapts seamlessly to pretrained vision backbones such as the CLIP image encoder and DINOv2.

Abstract

Zero-shot anomaly detection (ZSAD) requires detecting and localizing anomalies without access to target-class anomaly samples. Mainstream methods rely on vision-language models (VLMs) such as CLIP: they build hand-crafted or learned prompt sets for normal and abnormal semantics, then compute image-text similarities for open-set discrimination. While effective, this paradigm depends on a text encoder and cross-modal alignment, which can lead to training instability and parameter redundancy. This work revisits the necessity of the text branch in ZSAD and presents VisualAD, a purely visual framework built on Vision Transformers. We introduce two learnable tokens within a frozen backbone to directly encode normality and abnormality. Through multi-layer self-attention, these tokens interact with patch tokens, gradually acquiring high-level notions of normality and anomaly while guiding patches to highlight anomaly-related cues. Additionally, we incorporate a Spatial-Aware Cross-Attention (SCA) module and a lightweight Self-Alignment Function (SAF): SCA injects fine-grained spatial information into the tokens, and SAF recalibrates patch features before anomaly scoring. VisualAD achieves state-of-the-art performance on 13 zero-shot anomaly detection benchmarks spanning industrial and medical domains, and adapts seamlessly to pretrained vision backbones such as the CLIP image encoder and DINOv2. Code: https://github.com/7HHHHH/VisualAD

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

TL;DR

Abstract

Paper Structure (21 sections, 13 equations, 6 figures, 5 tables)

This paper contains 21 sections, 13 equations, 6 figures, 5 tables.

Introduction
Related work
Unsupervised Anomaly Detection.
Zero-shot Anomaly Detection.
Approach
Problem Setting
Spatial-Aware Cross-Attention
Self-Alignment Function and Anomaly Scoring
Training Objective
Experiments
Experimental Setup
Performance Comparison with SOTA Method
Ablation Studies
Effect of Different Components.
Ablation on different pretrained ViT.
...and 6 more sections

Figures (6)

Figure 1: Exploratory study motivating this work. (a) AnomalyCLIP derives normal/abnormal prototypes via trainable text prompts and a text encoder. (b) A purely visual variant removes the text branch and directly learns two visual prototypes, achieving comparable or slightly better results on VisA and MVTec with 99% fewer parameters. Radar and line plots show that our variant maintains similar accuracy but with much smoother evaluation curves, whereas AnomalyCLIP oscillates noticeably.
Figure 2: Overview of VisualAD. Two learnable global tokens (anomaly and normal) are inserted into a frozen ViT. For intermediate layers $\ell$, SCA uses a few anchor queries with positional encoding and token-guided gating to aggregate localized spatial evidence, yielding enhanced tokens $\tilde{\mathbf{t}}_a^{(\ell)}, \tilde{\mathbf{t}}_n^{(\ell)}$. In parallel, each layer's patch features are recalibrated by a SAF, denoted as $\mathcal{F}_\ell$. The cosine-similarity difference between enhanced tokens and recalibrated patches forms layer-wise anomaly maps, which are upsampled and summed to obtain the final anomaly map; the image-level score is the mean of the top-$k$ responses.
Figure 3: Qualitative comparison of anomaly segmentation results for different ZSAD methods. The first five columns show images from industrial datasets, while the last three columns correspond to medical datasets.
Figure 4: Radar plot comparing different pretrained ViT backbones under the ZSAD setting. We evaluate CLIP and DINO-based variants from small to large model scales. Larger backbones and higher input resolutions generally lead to consistent improvements in both pixel-level and sample-level metrics, with CLIP’s ViT-L/14@336px giving the strongest sample-level accuracy and DINO’s ViT-g/14 delivering the highest pixel-level precision.
Figure 5: PCA visualization of feature distributions under three configurations. From left to right: vanilla CLIP features, CLIP with normal/anomaly tokens, and CLIP with both tokens and the MLP-based transformation. The anomaly cluster (dark) becomes progressively more compact and moves farther from the normal cluster, while the variance concentrates along a single dominant axis, indicating increasingly discriminative representations for anomaly separation.
...and 1 more figures

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

TL;DR

Abstract

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (6)