Table of Contents
Fetching ...

A Hierarchically Feature Reconstructed Autoencoder for Unsupervised Anomaly Detection

Honghui Chen, Pingping Chen, Huan Mao, Mengxi Jiang

TL;DR

This work tackles unsupervised anomaly detection and localization without labeled anomalies or data augmentation by introducing a simple encoder–decoder architecture. A fixed, ImageNet-pretrained encoder extracts hierarchical features, and a decoder reconstructs these features across $K=3$ levels to generate multi-scale residual maps used for anomaly detection and localization. By training only the decoder to minimize multi-level feature reconstruction losses and fusing residuals into an anomaly map, the method achieves strong performance across MNIST, Fashion-MNIST, CIFAR-10, and MVTecAD, often surpassing state-of-the-art approaches. The approach is notable for its simplicity, efficiency (single forward pass during inference), and effectiveness in leveraging feature-space reconstruction rather than pixel-level recovery.

Abstract

Anomaly detection and localization without any manual annotations and prior knowledge is a challenging task under the setting of unsupervised learning. The existing works achieve excellent performance in the anomaly detection, but with complex networks or cumbersome pipelines. To address this issue, this paper explores a simple but effective architecture in the anomaly detection. It consists of a well pre-trained encoder to extract hierarchical feature representations and a decoder to reconstruct these intermediate features from the encoder. In particular, it does not require any data augmentations and anomalous images for training. The anomalies can be detected when the decoder fails to reconstruct features well, and then errors of hierarchical feature reconstruction are aggregated into an anomaly map to achieve anomaly localization. The difference comparison between those features of encoder and decode lead to more accurate and robust localization results than the comparison in single feature or pixel-by-pixel comparison in the conventional works. Experiment results show that the proposed method outperforms the state-of-the-art methods on MNIST, Fashion-MNIST, CIFAR-10, and MVTec Anomaly Detection datasets on both anomaly detection and localization.

A Hierarchically Feature Reconstructed Autoencoder for Unsupervised Anomaly Detection

TL;DR

This work tackles unsupervised anomaly detection and localization without labeled anomalies or data augmentation by introducing a simple encoder–decoder architecture. A fixed, ImageNet-pretrained encoder extracts hierarchical features, and a decoder reconstructs these features across levels to generate multi-scale residual maps used for anomaly detection and localization. By training only the decoder to minimize multi-level feature reconstruction losses and fusing residuals into an anomaly map, the method achieves strong performance across MNIST, Fashion-MNIST, CIFAR-10, and MVTecAD, often surpassing state-of-the-art approaches. The approach is notable for its simplicity, efficiency (single forward pass during inference), and effectiveness in leveraging feature-space reconstruction rather than pixel-level recovery.

Abstract

Anomaly detection and localization without any manual annotations and prior knowledge is a challenging task under the setting of unsupervised learning. The existing works achieve excellent performance in the anomaly detection, but with complex networks or cumbersome pipelines. To address this issue, this paper explores a simple but effective architecture in the anomaly detection. It consists of a well pre-trained encoder to extract hierarchical feature representations and a decoder to reconstruct these intermediate features from the encoder. In particular, it does not require any data augmentations and anomalous images for training. The anomalies can be detected when the decoder fails to reconstruct features well, and then errors of hierarchical feature reconstruction are aggregated into an anomaly map to achieve anomaly localization. The difference comparison between those features of encoder and decode lead to more accurate and robust localization results than the comparison in single feature or pixel-by-pixel comparison in the conventional works. Experiment results show that the proposed method outperforms the state-of-the-art methods on MNIST, Fashion-MNIST, CIFAR-10, and MVTec Anomaly Detection datasets on both anomaly detection and localization.
Paper Structure (16 sections, 11 equations, 4 figures, 4 tables)

This paper contains 16 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The visualization results of our method on the MVTecAD dataset. Each column represents a category in the dataset, and rows from top to bottom correspond to defective images, ground truth regions, and anomaly heat maps inferred by our method.
  • Figure 2: An overview of our proposed framework. The encoder extracts the feature representations of different layers of the input image, and the decoder decodes the high-dimensional output feature of the encoder to reconstruct the features of different layers. The residual feature maps $\{ \phi_1, \phi_2, \phi_3\}$ between the hierarchical features of the encoder and the decoder is expressed as a hierarchical feature reconstruction loss in the training phase to guide the decoder to reconstruct the features of the normal image, and in the inference phase as an anomaly map to detect and locate anomalies.
  • Figure 3: The qualitative result of the residual feature maps of our method on the MVTecAD dataset. Input images, ground truth regions, residual feature maps $\{ \phi_1, \phi_2, \phi_3\}$, and anomaly maps are displayed along the column direction.
  • Figure 4: Localization results of test examples with different types of anomalies, i.e. crack, glue strip, gray stroke, oil and rough in (a) tile and bent lead, cut lead, damaged case and misplaced case in (b) transistor.