Table of Contents
Fetching ...

Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Jiangning Zhang, Xuhai Chen, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Ming-Hsuan Yang, Dacheng Tao

TL;DR

The paper tackles MUAD by proposing a Meta-AD framework and a plain ViT-based instantiation, ViTAD, that forgoes pyramidal encoders/decoders. It demonstrates that a simple, globally attentive ViT architecture can achieve state-of-the-art anomaly detection and localization across MVTec AD, VisA, and Uni-Medical datasets with efficient training. Key contributions include the Meta-AD formulation, a globally aware ViTAD design with minimal fusion, and comprehensive ablations showing robustness to design choices and pretraining. The results indicate strong practical impact: competitive performance with significantly lower computational costs, enabling scalable MUAD in real-world industrial and medical contexts. The work also provides rich insights into the role of pretraining, model scale, and frequency-based explanations for anomaly detection using ViT.

Abstract

This work studies a challenging and practical issue known as multi-class unsupervised anomaly detection (MUAD). This problem requires only normal images for training while simultaneously testing both normal and anomaly images across multiple classes. Existing reconstruction-based methods typically adopt pyramidal networks as encoders and decoders to obtain multi-resolution features, often involving complex sub-modules with extensive handcraft engineering. In contrast, a plain Vision Transformer (ViT) showcasing a more straightforward architecture has proven effective in multiple domains, including detection and segmentation tasks. It is simpler, more effective, and elegant. Following this spirit, we explore the use of only plain ViT features for MUAD. We first abstract a Meta-AD concept by synthesizing current reconstruction-based methods. Subsequently, we instantiate a novel ViT-based ViTAD structure, designed incrementally from both global and local perspectives. This model provide a strong baseline to facilitate future research. Additionally, this paper uncovers several intriguing findings for further investigation. Finally, we comprehensively and fairly benchmark various approaches using eight metrics. Utilizing a basic training regimen with only an MSE loss, ViTAD achieves state-of-the-art results and efficiency on MVTec AD, VisA, and Uni-Medical datasets. \Eg, achieving 85.4 mAD that surpasses UniAD by +3.0 for the MVTec AD dataset, and it requires only 1.1 hours and 2.3G GPU memory to complete model training on a single V100 that can serve as a strong baseline to facilitate the development of future research. Full code is available at https://zhangzjn.github.io/projects/ViTAD/.

Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

TL;DR

The paper tackles MUAD by proposing a Meta-AD framework and a plain ViT-based instantiation, ViTAD, that forgoes pyramidal encoders/decoders. It demonstrates that a simple, globally attentive ViT architecture can achieve state-of-the-art anomaly detection and localization across MVTec AD, VisA, and Uni-Medical datasets with efficient training. Key contributions include the Meta-AD formulation, a globally aware ViTAD design with minimal fusion, and comprehensive ablations showing robustness to design choices and pretraining. The results indicate strong practical impact: competitive performance with significantly lower computational costs, enabling scalable MUAD in real-world industrial and medical contexts. The work also provides rich insights into the role of pretraining, model scale, and frequency-based explanations for anomaly detection using ViT.

Abstract

This work studies a challenging and practical issue known as multi-class unsupervised anomaly detection (MUAD). This problem requires only normal images for training while simultaneously testing both normal and anomaly images across multiple classes. Existing reconstruction-based methods typically adopt pyramidal networks as encoders and decoders to obtain multi-resolution features, often involving complex sub-modules with extensive handcraft engineering. In contrast, a plain Vision Transformer (ViT) showcasing a more straightforward architecture has proven effective in multiple domains, including detection and segmentation tasks. It is simpler, more effective, and elegant. Following this spirit, we explore the use of only plain ViT features for MUAD. We first abstract a Meta-AD concept by synthesizing current reconstruction-based methods. Subsequently, we instantiate a novel ViT-based ViTAD structure, designed incrementally from both global and local perspectives. This model provide a strong baseline to facilitate future research. Additionally, this paper uncovers several intriguing findings for further investigation. Finally, we comprehensively and fairly benchmark various approaches using eight metrics. Utilizing a basic training regimen with only an MSE loss, ViTAD achieves state-of-the-art results and efficiency on MVTec AD, VisA, and Uni-Medical datasets. \Eg, achieving 85.4 mAD that surpasses UniAD by +3.0 for the MVTec AD dataset, and it requires only 1.1 hours and 2.3G GPU memory to complete model training on a single V100 that can serve as a strong baseline to facilitate the development of future research. Full code is available at https://zhangzjn.github.io/projects/ViTAD/.
Paper Structure (23 sections, 7 equations, 11 figures, 16 tables)

This paper contains 23 sections, 7 equations, 11 figures, 16 tables.

Figures (11)

  • Figure 1: Left: (a-c) display general reconstruction-based AD frameworks. (d) shows a Meta-AD framework that consists of image Encoder $\phi^{E}$, Fuser $\mathcal{F}$, and Decoder $\phi^{D}$. The dashed line indicates that the feature may be used by the Fuser $\mathcal{F}$. Right: Comprehensive quantitative comparison with popular methods by eight metrics on MVTec AD dataset mvtec (see \ref{['section:setup']} and \ref{['exp:sotas']}).
  • Figure 2: Diagram of Multi-class Unsupervised AD setting.
  • Figure 3: Reconstruction-based Meta-AD paradigm, which consists of a pretrained image encoder $\phi^{E}$ to obtain features at different depths from the patch embedding input, a feature fuser $\mathcal{F}$ to aggregate extracted multiple features, and a decoder $\phi^{D}$ that has the same structure with the encoder to reconstruct multi-depth features. During the training phase, $\hat{F}_{i}$ is constrained by $F_{i}$ with loss function $\mathcal{L}_{i}$ to update $\phi^{D}$, while both $\hat{F}_{i}$ and $F_{i}$ are used to calculate anomaly map $A_{i}$ for inference.
  • Figure 4: Left: Pilot study for the necessity of global dependence on "cable" category, e.g., logically-dependent "cable swap" and long-distance dependent "combined" defects. Right: Quantitative evaluations. Our ViTAD markedly mitigates these challenges.
  • Figure 5: First 36 visualized feature maps of different stages ($S_{i}$, $i=1, 2, 3$) for pyramidrd ($S^{P}$) and columnarvit ($S^{C}$) backbones. The first two rows show the results of normal and anomaly images in the first column, and the last row shows differential maps. Results demonstrate the superiority of ViT for capturing more abundant features and locating more distinct anomalous regions.
  • ...and 6 more figures