Table of Contents
Fetching ...

Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection

Jia Guo, Shuai Lu, Weihang Zhang, Fang Chen, Huiqi Li, Hongen Liao

TL;DR

Dinomaly tackles multi-class unsupervised anomaly detection by proposing a minimalistic, transformer-only framework built from Foundation Transformers, a Noisy Bottleneck, Linear Attention, and Loose Reconstruction. By leveraging a pretrained ViT encoder and an 8-layer decoder, it reconstructs mid-level features and detects anomalies via encoder–decoder discrepancies, with a dropout-based noise injection and relaxed reconstruction constraints to avoid identity mapping. Across MVTec-AD, VisA, and Real-IAD, Dinomaly achieves state-of-the-art MUAD performance and scales effectively with model size and input resolution, while maintaining competitiveness with class-separated models. This work demonstrates that a simple, universal transformer-based approach can close much of the gap between MUAD and specialized per-class models, offering practical scalability and broad applicability in real-world anomaly detection tasks.

Abstract

Recent studies highlighted a practical setting of unsupervised anomaly detection (UAD) that builds a unified model for multi-class images. Despite various advancements addressing this challenging task, the detection performance under the multi-class setting still lags far behind state-of-the-art class-separated models. Our research aims to bridge this substantial performance gap. In this paper, we introduce a minimalistic reconstruction-based anomaly detection framework, namely Dinomaly, which leverages pure Transformer architectures without relying on complex designs, additional modules, or specialized tricks. Given this powerful framework consisted of only Attentions and MLPs, we found four simple components that are essential to multi-class anomaly detection: (1) Foundation Transformers that extracts universal and discriminative features, (2) Noisy Bottleneck where pre-existing Dropouts do all the noise injection tricks, (3) Linear Attention that naturally cannot focus, and (4) Loose Reconstruction that does not force layer-to-layer and point-by-point reconstruction. Extensive experiments are conducted across popular anomaly detection benchmarks including MVTec-AD, VisA, and Real-IAD. Our proposed Dinomaly achieves impressive image-level AUROC of 99.6%, 98.7%, and 89.3% on the three datasets respectively, which is not only superior to state-of-the-art multi-class UAD methods, but also achieves the most advanced class-separated UAD records.

Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection

TL;DR

Dinomaly tackles multi-class unsupervised anomaly detection by proposing a minimalistic, transformer-only framework built from Foundation Transformers, a Noisy Bottleneck, Linear Attention, and Loose Reconstruction. By leveraging a pretrained ViT encoder and an 8-layer decoder, it reconstructs mid-level features and detects anomalies via encoder–decoder discrepancies, with a dropout-based noise injection and relaxed reconstruction constraints to avoid identity mapping. Across MVTec-AD, VisA, and Real-IAD, Dinomaly achieves state-of-the-art MUAD performance and scales effectively with model size and input resolution, while maintaining competitiveness with class-separated models. This work demonstrates that a simple, universal transformer-based approach can close much of the gap between MUAD and specialized per-class models, offering practical scalability and broad applicability in real-world anomaly detection tasks.

Abstract

Recent studies highlighted a practical setting of unsupervised anomaly detection (UAD) that builds a unified model for multi-class images. Despite various advancements addressing this challenging task, the detection performance under the multi-class setting still lags far behind state-of-the-art class-separated models. Our research aims to bridge this substantial performance gap. In this paper, we introduce a minimalistic reconstruction-based anomaly detection framework, namely Dinomaly, which leverages pure Transformer architectures without relying on complex designs, additional modules, or specialized tricks. Given this powerful framework consisted of only Attentions and MLPs, we found four simple components that are essential to multi-class anomaly detection: (1) Foundation Transformers that extracts universal and discriminative features, (2) Noisy Bottleneck where pre-existing Dropouts do all the noise injection tricks, (3) Linear Attention that naturally cannot focus, and (4) Loose Reconstruction that does not force layer-to-layer and point-by-point reconstruction. Extensive experiments are conducted across popular anomaly detection benchmarks including MVTec-AD, VisA, and Real-IAD. Our proposed Dinomaly achieves impressive image-level AUROC of 99.6%, 98.7%, and 89.3% on the three datasets respectively, which is not only superior to state-of-the-art multi-class UAD methods, but also achieves the most advanced class-separated UAD records.
Paper Structure (20 sections, 7 equations, 8 figures, 24 tables)

This paper contains 20 sections, 7 equations, 8 figures, 24 tables.

Figures (8)

  • Figure 1: Setting, benchmarking, and scaling of Dinomaly. (a) Task setting of class-separated UAD. (b) Task setting of MUAD. (c) Comparison with previous SoTA methods on MVTec-AD bergmann2019mvtec, VisA zou2022spot, and Real-IAD wang2024real. (d) Scaling properties of Dinomaly.
  • Figure 2: The framework of Dinomaly, built by simple and pure Transformer building blocks.
  • Figure 3: Softmax Attention vs. Linear Attention. (a) Visualization of attention maps. (b) Attention distribution.
  • Figure 4: Schemes of reconstruction constraint. (a) Layer-to-layer (sparse). (b) Layer-to-cat-layer. (c) Layer-to-layer (dense). (d) Loose group-to-group, 1-group (Ours). (e) Loose group-to-group, 2-group (Ours).
  • Figure 5: Image-level AUROC of Dinomaly equipped with various ViT foundations, and their linear-probing accuracy on ImageNet. MIM: Masked Image Modeling. CL: Contrastive Learning.
  • ...and 3 more figures