UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression

Yuan Zhao; Youwei Pang; Lihe Zhang; Hanqi Liu; Jiaming Zuo; Huchuan Lu; Xiaoqi Zhao

UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression

Yuan Zhao, Youwei Pang, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Huchuan Lu, Xiaoqi Zhao

TL;DR

UniMMAD tackles the fragmentation of anomaly detection by unifying multi-modal and multi-class AD within a single architecture. It introduces a general→specific decomposition powered by a general multi-modal encoder and a Cross MoE (C-MoE) decoder, augmented with a MoE-in-MoE hierarchy and grouped dynamic filtering for parameter efficiency. The model achieves state-of-the-art results across nine diverse datasets, while enabling continual learning and fast inference, demonstrating strong practical impact for real-world industrial, synthetic, and medical scenarios. By leveraging domain priors to guide adaptive decompression, UniMMAD provides precise localization and robust performance across varied modalities and classes, reducing memory footprint without sacrificing accuracy.

Abstract

Existing anomaly detection (AD) methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains. This process is guided by a ``general to specific'' paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning. In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and a MoE-in-MoE structure, reducing parameter usage by 75\% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The source code will be available at https://github.com/yuanzhao-CVLAB/UniMMAD.

UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression

TL;DR

Abstract

UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)