Table of Contents
Fetching ...

UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression

Yuan Zhao, Youwei Pang, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Huchuan Lu, Xiaoqi Zhao

TL;DR

UniMMAD tackles the fragmentation of anomaly detection by unifying multi-modal and multi-class AD within a single architecture. It introduces a general→specific decomposition powered by a general multi-modal encoder and a Cross MoE (C-MoE) decoder, augmented with a MoE-in-MoE hierarchy and grouped dynamic filtering for parameter efficiency. The model achieves state-of-the-art results across nine diverse datasets, while enabling continual learning and fast inference, demonstrating strong practical impact for real-world industrial, synthetic, and medical scenarios. By leveraging domain priors to guide adaptive decompression, UniMMAD provides precise localization and robust performance across varied modalities and classes, reducing memory footprint without sacrificing accuracy.

Abstract

Existing anomaly detection (AD) methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains. This process is guided by a ``general to specific'' paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning. In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and a MoE-in-MoE structure, reducing parameter usage by 75\% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The source code will be available at https://github.com/yuanzhao-CVLAB/UniMMAD.

UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression

TL;DR

UniMMAD tackles the fragmentation of anomaly detection by unifying multi-modal and multi-class AD within a single architecture. It introduces a general→specific decomposition powered by a general multi-modal encoder and a Cross MoE (C-MoE) decoder, augmented with a MoE-in-MoE hierarchy and grouped dynamic filtering for parameter efficiency. The model achieves state-of-the-art results across nine diverse datasets, while enabling continual learning and fast inference, demonstrating strong practical impact for real-world industrial, synthetic, and medical scenarios. By leveraging domain priors to guide adaptive decompression, UniMMAD provides precise localization and robust performance across varied modalities and classes, reducing memory footprint without sacrificing accuracy.

Abstract

Existing anomaly detection (AD) methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains. This process is guided by a ``general to specific'' paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning. In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and a MoE-in-MoE structure, reducing parameter usage by 75\% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The source code will be available at https://github.com/yuanzhao-CVLAB/UniMMAD.

Paper Structure

This paper contains 25 sections, 4 equations, 14 figures, 18 tables.

Figures (14)

  • Figure 1: (a) Existing methods li2024multiwang2023multimodalcostanzino2024multimodal rely on specialized models tailored to individual modalities and classes. (b) The proposed UniMMAD model unifies multi-modal and multi-class anomaly detection tasks within a single framework. (c) Visual examples, with modalities highlighted in white, class names in yellow, and anomaly regions marked by red boxes.
  • Figure 2: (a) Overview of the fields, modalities, and classes encompassed by UniMMAD. (b) Architectures of UniMMAD and mainstream multi-class methods you2022unifiedhe2024mambaad. Reconstructed feature distribution in normal regions is shown on the right. Larger distances between original features (triangles $\blacktriangledown$) and reconstructed features (circles ●) mean a higher risk of false positives.
  • Figure 3: Overview of the UniMMAD. It processes various modality combinations via a general multi-modal encoder and a Feature Compression Module (FCM). The FCM comprises a hierarchical BottleNeck-$K$ structure and residual blocks (ResBlock), where BottleNeck-$K$ uses $K\times K$ convolutions to capture scale information and $1\times 1$ convolutions to adjust dimensions. A prior generator provides domain-specific priors to guide the C-MoE in decompressing general features into domain-specific ones.
  • Figure 4: Detailed architecture of C-MoE. It selects expert indices based on domain-specific priors using a condition router, activates the corresponding routed experts and a fixed expert, and decompresses general features via grouped dynamic filtering. Each routed expert adopts an MoE-in-MoE structure to improve parameter efficiency.
  • Figure 5: Qualitative comparisons across three scene, with our method highlighted in red, modality in white and class in yellow.
  • ...and 9 more figures