Table of Contents
Fetching ...

Self-Supervised Training with Autoencoders for Visual Anomaly Detection

Alexander Bauer, Shinichi Nakajima, Klaus-Robert Müller

TL;DR

The paper tackles visual anomaly detection when normal data lie on a low‑dimensional manifold and proposes a self‑supervised autoencoder training regime using partially distorted inputs to enforce locally consistent reconstructions while suppressing anomalous patterns. Theoretical analysis shows the learned mapping converges to a nonlinear orthogonal projection onto the normal manifold and connects to RCAE/DAE/CAE, with data augmentation choices designed to preserve this projection. Empirically, the approach achieves state‑of‑the‑art results on the MVTec AD benchmark for both detection and localization, demonstrating strong practical impact for manufacturing and related domains. The work thus unifies self‑supervised reconstruction with manifold‑based anomaly detection under an orthogonal‑projection framework, offering both rigorous guarantees and strong empirical performance.

Abstract

We focus on a specific use case in anomaly detection where the distribution of normal samples is supported by a lower-dimensional manifold. Here, regularized autoencoders provide a popular approach by learning the identity mapping on the set of normal examples, while trying to prevent good reconstruction on points outside of the manifold. Typically, this goal is implemented by controlling the capacity of the model, either directly by reducing the size of the bottleneck layer or implicitly by imposing some sparsity (or contraction) constraints on parts of the corresponding network. However, neither of these techniques does explicitly penalize the reconstruction of anomalous signals often resulting in poor detection. We tackle this problem by adapting a self-supervised learning regime that exploits discriminative information during training but focuses on the submanifold of normal examples. Informally, our training objective regularizes the model to produce locally consistent reconstructions, while replacing irregularities by acting as a filter that removes anomalous patterns. To support this intuition, we perform a rigorous formal analysis of the proposed method and provide a number of interesting insights. In particular, we show that the resulting model resembles a non-linear orthogonal projection of partially corrupted images onto the submanifold of uncorrupted samples. On the other hand, we identify the orthogonal projection as an optimal solution for a number of regularized autoencoders including the contractive and denoising variants. We support our theoretical analysis by empirical evaluation of the resulting detection and localization performance of the proposed method. In particular, we achieve a new state-of-the-art result on the MVTec AD dataset -- a challenging benchmark for visual anomaly detection in the manufacturing domain.

Self-Supervised Training with Autoencoders for Visual Anomaly Detection

TL;DR

The paper tackles visual anomaly detection when normal data lie on a low‑dimensional manifold and proposes a self‑supervised autoencoder training regime using partially distorted inputs to enforce locally consistent reconstructions while suppressing anomalous patterns. Theoretical analysis shows the learned mapping converges to a nonlinear orthogonal projection onto the normal manifold and connects to RCAE/DAE/CAE, with data augmentation choices designed to preserve this projection. Empirically, the approach achieves state‑of‑the‑art results on the MVTec AD benchmark for both detection and localization, demonstrating strong practical impact for manufacturing and related domains. The work thus unifies self‑supervised reconstruction with manifold‑based anomaly detection under an orthogonal‑projection framework, offering both rigorous guarantees and strong empirical performance.

Abstract

We focus on a specific use case in anomaly detection where the distribution of normal samples is supported by a lower-dimensional manifold. Here, regularized autoencoders provide a popular approach by learning the identity mapping on the set of normal examples, while trying to prevent good reconstruction on points outside of the manifold. Typically, this goal is implemented by controlling the capacity of the model, either directly by reducing the size of the bottleneck layer or implicitly by imposing some sparsity (or contraction) constraints on parts of the corresponding network. However, neither of these techniques does explicitly penalize the reconstruction of anomalous signals often resulting in poor detection. We tackle this problem by adapting a self-supervised learning regime that exploits discriminative information during training but focuses on the submanifold of normal examples. Informally, our training objective regularizes the model to produce locally consistent reconstructions, while replacing irregularities by acting as a filter that removes anomalous patterns. To support this intuition, we perform a rigorous formal analysis of the proposed method and provide a number of interesting insights. In particular, we show that the resulting model resembles a non-linear orthogonal projection of partially corrupted images onto the submanifold of uncorrupted samples. On the other hand, we identify the orthogonal projection as an optimal solution for a number of regularized autoencoders including the contractive and denoising variants. We support our theoretical analysis by empirical evaluation of the resulting detection and localization performance of the proposed method. In particular, we achieve a new state-of-the-art result on the MVTec AD dataset -- a challenging benchmark for visual anomaly detection in the manufacturing domain.
Paper Structure (23 sections, 14 theorems, 46 equations, 15 figures, 3 tables)

This paper contains 23 sections, 14 theorems, 46 equations, 15 figures, 3 tables.

Key Result

Proposition 1

Figures (15)

  • Figure 1: A few anomaly detection results of our approach. Each row shows the input image, an overlay with the anomaly heatmap, and the resulting prediction mask, respectively.
  • Figure 2: Illustration of the reconstruction effect of our model trained either on the wood, carpet or grid images (without defects) from the MVTec AD dataset.
  • Figure 3: Illustration of our AD process. Given input $\hat{\boldsymbol{x}}$, we first compute an output $f_{\boldsymbol{\theta}}(\hat{\boldsymbol{x}})$ by replicating normal regions and replacing irregularities with locally consistent patterns. Then we compute a pixel-wise squared difference $(\hat{\boldsymbol{x}} - f_{\boldsymbol{\theta}}(\hat{\boldsymbol{x}}))^2$, which is subsequently averaged over the color channels to produce the difference map $\text{Diff}[\hat{\boldsymbol{x}}, f_{\boldsymbol{\theta}}(\hat{\boldsymbol{x}})] \in \mathbb{R}^{h \times w}$. In the last step we apply a series of averaging convolutions $G_k$ to the difference map to produce our final anomaly heatmap $\text{anomap}_{f_{\boldsymbol{\theta}}}^{n,k}(\hat{\boldsymbol{x}})$.
  • Figure 4: Illustration of data generation for training. After randomly choosing the locations of the patches to be modified, we create a new content by glueing the extracted patches with the corresponding replacements. Given a real-valued mask $\boldsymbol{M} \in [0,1]^{\tilde{h} \times \tilde{w} \times 3}$ marking corrupted regions within a patch, an original image patch $\boldsymbol{x}$, and a corresponding replacement $\boldsymbol{y}$, we create the next corrupted patch by merging the two patches together according to the formula $\hat{\boldsymbol{x}} := \boldsymbol{M} \odot \boldsymbol{y} + \bar{\boldsymbol{M}} \odot \boldsymbol{x}$. All mask shapes $\boldsymbol{M}$ are created by applying gaussian distortion to the same (static) mask representing a filled disk at the center of the patch with a smoothly fading boundary towards the exterior of the disk.
  • Figure 5: Illustration of our network architecture Model II including the Convex Combination Module (CCM) marked with brown color and the skip-connections represented by the horizontal arrows. Without these additional elements we get the baseline Model I.
  • ...and 10 more figures

Theorems & Definitions (17)

  • Definition 1
  • Proposition 1
  • Definition 2
  • Definition 3
  • Proposition 2
  • Proposition 3
  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • ...and 7 more