Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection
Jia Guo, Shuai Lu, Weihang Zhang, Fang Chen, Huiqi Li, Hongen Liao
TL;DR
Dinomaly tackles multi-class unsupervised anomaly detection by proposing a minimalistic, transformer-only framework built from Foundation Transformers, a Noisy Bottleneck, Linear Attention, and Loose Reconstruction. By leveraging a pretrained ViT encoder and an 8-layer decoder, it reconstructs mid-level features and detects anomalies via encoder–decoder discrepancies, with a dropout-based noise injection and relaxed reconstruction constraints to avoid identity mapping. Across MVTec-AD, VisA, and Real-IAD, Dinomaly achieves state-of-the-art MUAD performance and scales effectively with model size and input resolution, while maintaining competitiveness with class-separated models. This work demonstrates that a simple, universal transformer-based approach can close much of the gap between MUAD and specialized per-class models, offering practical scalability and broad applicability in real-world anomaly detection tasks.
Abstract
Recent studies highlighted a practical setting of unsupervised anomaly detection (UAD) that builds a unified model for multi-class images. Despite various advancements addressing this challenging task, the detection performance under the multi-class setting still lags far behind state-of-the-art class-separated models. Our research aims to bridge this substantial performance gap. In this paper, we introduce a minimalistic reconstruction-based anomaly detection framework, namely Dinomaly, which leverages pure Transformer architectures without relying on complex designs, additional modules, or specialized tricks. Given this powerful framework consisted of only Attentions and MLPs, we found four simple components that are essential to multi-class anomaly detection: (1) Foundation Transformers that extracts universal and discriminative features, (2) Noisy Bottleneck where pre-existing Dropouts do all the noise injection tricks, (3) Linear Attention that naturally cannot focus, and (4) Loose Reconstruction that does not force layer-to-layer and point-by-point reconstruction. Extensive experiments are conducted across popular anomaly detection benchmarks including MVTec-AD, VisA, and Real-IAD. Our proposed Dinomaly achieves impressive image-level AUROC of 99.6%, 98.7%, and 89.3% on the three datasets respectively, which is not only superior to state-of-the-art multi-class UAD methods, but also achieves the most advanced class-separated UAD records.
