Table of Contents
Fetching ...

ADPretrain: Advancing Industrial Anomaly Detection via Anomaly Representation Pretraining

Xincheng Yao, Yan Luo, Zefeng Qian, Chongyang Zhang

TL;DR

ADPretrain addresses the mismatch between ImageNet pretraining and industrial anomaly detection by learning anomaly-specific representations on RealIAD. It introduces residual features and two contrastive objectives—angle-oriented contrastive loss and norm-oriented contraction loss—embedded in a Transformer-based Feature Projector to produce transferable pretrained representations. Replacing original features in five embedding-based AD methods across five datasets yields consistent performance gains and demonstrates improved few-shot anomaly detection when using feature norms as scores. The work highlights the value of domain-specific pretraining for anomaly detection and points to future directions in backbone design and broader applicability across AD models.

Abstract

The current mainstream and state-of-the-art anomaly detection (AD) methods are substantially established on pretrained feature networks yielded by ImageNet pretraining. However, regardless of supervised or self-supervised pretraining, the pretraining process on ImageNet does not match the goal of anomaly detection (i.e., pretraining in natural images doesn't aim to distinguish between normal and abnormal). Moreover, natural images and industrial image data in AD scenarios typically have the distribution shift. The two issues can cause ImageNet-pretrained features to be suboptimal for AD tasks. To further promote the development of the AD field, pretrained representations specially for AD tasks are eager and very valuable. To this end, we propose a novel AD representation learning framework specially designed for learning robust and discriminative pretrained representations for industrial anomaly detection. Specifically, closely surrounding the goal of anomaly detection (i.e., focus on discrepancies between normals and anomalies), we propose angle- and norm-oriented contrastive losses to maximize the angle size and norm difference between normal and abnormal features simultaneously. To avoid the distribution shift from natural images to AD images, our pretraining is performed on a large-scale AD dataset, RealIAD. To further alleviate the potential shift between pretraining data and downstream AD datasets, we learn the pretrained AD representations based on the class-generalizable representation, residual features. For evaluation, based on five embedding-based AD methods, we simply replace their original features with our pretrained representations. Extensive experiments on five AD datasets and five backbones consistently show the superiority of our pretrained features. The code is available at https://github.com/xcyao00/ADPretrain.

ADPretrain: Advancing Industrial Anomaly Detection via Anomaly Representation Pretraining

TL;DR

ADPretrain addresses the mismatch between ImageNet pretraining and industrial anomaly detection by learning anomaly-specific representations on RealIAD. It introduces residual features and two contrastive objectives—angle-oriented contrastive loss and norm-oriented contraction loss—embedded in a Transformer-based Feature Projector to produce transferable pretrained representations. Replacing original features in five embedding-based AD methods across five datasets yields consistent performance gains and demonstrates improved few-shot anomaly detection when using feature norms as scores. The work highlights the value of domain-specific pretraining for anomaly detection and points to future directions in backbone design and broader applicability across AD models.

Abstract

The current mainstream and state-of-the-art anomaly detection (AD) methods are substantially established on pretrained feature networks yielded by ImageNet pretraining. However, regardless of supervised or self-supervised pretraining, the pretraining process on ImageNet does not match the goal of anomaly detection (i.e., pretraining in natural images doesn't aim to distinguish between normal and abnormal). Moreover, natural images and industrial image data in AD scenarios typically have the distribution shift. The two issues can cause ImageNet-pretrained features to be suboptimal for AD tasks. To further promote the development of the AD field, pretrained representations specially for AD tasks are eager and very valuable. To this end, we propose a novel AD representation learning framework specially designed for learning robust and discriminative pretrained representations for industrial anomaly detection. Specifically, closely surrounding the goal of anomaly detection (i.e., focus on discrepancies between normals and anomalies), we propose angle- and norm-oriented contrastive losses to maximize the angle size and norm difference between normal and abnormal features simultaneously. To avoid the distribution shift from natural images to AD images, our pretraining is performed on a large-scale AD dataset, RealIAD. To further alleviate the potential shift between pretraining data and downstream AD datasets, we learn the pretrained AD representations based on the class-generalizable representation, residual features. For evaluation, based on five embedding-based AD methods, we simply replace their original features with our pretrained representations. Extensive experiments on five AD datasets and five backbones consistently show the superiority of our pretrained features. The code is available at https://github.com/xcyao00/ADPretrain.

Paper Structure

This paper contains 26 sections, 9 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: (a) Conceptual illustration of anomaly representation pretraining. (b) Performance comparison on MVTecAD (left) and VisA (right). "w/o" and "w/" refer to without and with our pretrained features. Under multiple AD methods and backbones, our pretrained features are consistently superior to the original features (dashed lines are overall on top of solid lines).
  • Figure 2: Framework overview. We learn pretrained AD representations based on residual features, while not the features directly produced by the backbone network. Residual features are generated by subtracting normal reference features (Sec.\ref{['sec:pre_trained_representations']}), which are extracted from the normal reference samples. Features yielded by the Feature Projector are optimized by the angle- and norm-oriented contrastive losses (Sec.\ref{['sec:contrastive_loss']}). The Feature Projector is based on the Transformer architecture, but we alter self-attention to our proposed learnable key/value attention (Sec.\ref{['sec:feature_projector']}).
  • Figure 3: Feature t-SNE visualization. "w/o" and "w/" refer to without and with our pretrained features. These features are from the "capsules" class of the VisA dataset. We show more visualization results in Fig.\ref{['fig:tsne_visualization_sup']} in Appendix \ref{['sec:qualitative_results']}.
  • Figure 4: Qualitative results. The anomaly score maps are generated by PatchCore with CLIP-L as the backbone network. "w/o pretrained" and "w/ pretrained" refer to without and with our pretrained features.
  • Figure 5: Feature t-SNE visualization. For (a), (b), (c), and (d), the features are from the "candle", "chewinggum", "pcb1", and "fryum" classes from the VisA dataset.