HaloAE: An HaloNet based Local Transformer Auto-Encoder for Anomaly Detection and Localization

E. Mathian; H. Liu; L. Fernandez-Cuesta; D. Samaras; M. Foll; L. Chen

HaloAE: An HaloNet based Local Transformer Auto-Encoder for Anomaly Detection and Localization

E. Mathian, H. Liu, L. Fernandez-Cuesta, D. Samaras, M. Foll, L. Chen

TL;DR

HaloAE addresses unsupervised anomaly detection and localization by integrating a HaloNet-based local self-attention auto-encoder with multi-scale CNN features and self-supervised learning. The method combines a Cut&Paste SSL proxy task, VGG19-derived multi-scale feature maps, and a HaloNet encoder–decoder to produce pixel-wise anomaly maps from feature-map and image reconstructions. Adaptive loss weighting and an SSL framework significantly improve both image-level detection and pixel-level localization, achieving competitive averages of ROC-AUC 91.4% (image) and 91.2% (pixel) on the MVTec AD dataset. This work demonstrates that local self-attention in vision transformers can effectively complement convolutional processing for anomaly detection, suggesting broader applicability of local-transformer architectures in real-world industrial inspection scenarios.

Abstract

Unsupervised anomaly detection and localization is a crucial task as it is impossible to collect and label all possible anomalies. Many studies have emphasized the importance of integrating local and global information to achieve accurate segmentation of anomalies. To this end, there has been a growing interest in Transformer, which allows modeling long-range content interactions. However, global interactions through self attention are generally too expensive for most image scales. In this study, we introduce HaloAE, the first auto-encoder based on a local 2D version of Transformer with HaloNet. With HaloAE, we have created a hybrid model that combines convolution and local 2D block-wise self-attention layers and jointly performs anomaly detection and segmentation through a single model. We achieved competitive results on the MVTec dataset, suggesting that vision models incorporating Transformer could benefit from a local computation of the self-attention operation, and pave the way for other applications.

HaloAE: An HaloNet based Local Transformer Auto-Encoder for Anomaly Detection and Localization

TL;DR

Abstract

Paper Structure (22 sections, 7 equations, 6 figures, 5 tables)

This paper contains 22 sections, 7 equations, 6 figures, 5 tables.

Introduction
Related Work
Anomaly Detection and Localization Models
Reconstruction Based Methods:
Distribution-based methods:
Self-supervised learning based methods:
Visual Transformer
Global approaches:
Local approaches:
Transformer for anomaly detection
Method
Architecture
Self-supervised learning framework:
Image features extraction:
Reconstruction strategy grounded on Halonet:
...and 7 more sections

Figures (6)

Figure 1: Anomaly localization results from the MVTec AD dataset. The first and third rows show the input images, the second and fourth rows show the anomaly maps generated by HaloAE, the ground truth localization is circled with a pink line.
Figure 2: Overview of HaloAE for AD. A) Cut&Paste data augmentation strategy for the SSL li2021cutpaste. B) Multi-scaled feature map extraction via a pretrained VGG19 network simonyan2014vgg19 on ImageNet deng2009imagenet. C) Halonet AE for feature map reconstruction. D) Reconstruction of images via transposed VGG blocks. E) Linear layer to determine the classification loss. $Im$ and $\hat{Im}$, refer to the original image and the reconstructed image respectively, similarly $fm$ and $\hat{fm}$ refer to the feature map and its reconstruction. $l$ and $\hat{l}$ refer to the label and its prediction, $0$ is associated to the original picture, and $1$ to its augmented versions. $\mathcal{L}_{cls}$, $\mathcal{L}_{Rec_{fm}}$ and $\mathcal{L}_{Rec_{im}}$ refer to the classification loss and reconstruction quality of feature maps and images respectively.
Figure 3: Evolution of $\bm{\alpha}$ along with the number of epochs. Each curve is modeled according to the following equation, whose parameters are indicated in the legend: $\frac{(a-b)}{1+\exp(x-\frac{c}{2})^{0.05}} + b$. The $\bm{\alpha}$ values are then normalized so that they sum to 1.
Figure S1: Post processing workflow. A) Input image. B) Anomaly map (see eq.6 in main text). C) Normalized anomaly map (see eq.7 in main text). D) Normalized anomaly map smoothed with a Gaussian filter.
Figure S2: Classification results on carpet. A) Distribution of means of post-processed anomaly maps computed on the feature map, for defect free and abnormal objects. B) Distribution of means of post-processed anomaly maps computed on the feature map by defect category. Defect-free objects and anomalous objects have similar distributions. C) Carpet anomaly map, on the left objects without defects, on the right abnormal objects.
...and 1 more figures

HaloAE: An HaloNet based Local Transformer Auto-Encoder for Anomaly Detection and Localization

TL;DR

Abstract

HaloAE: An HaloNet based Local Transformer Auto-Encoder for Anomaly Detection and Localization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)