WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection

Peng Chen; Chao Huang

WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection

Peng Chen, Chao Huang

TL;DR

A variational autoencoder is employed to model global semantic representations and integrate them into prompts to enhance adaptability to diverse anomaly patterns and a semantic-aware mixture-of-experts module is introduced to aggregate contextual information.

Abstract

Vision-language models have recently shown strong generalization in zero-shot anomaly detection (ZSAD), enabling the detection of unseen anomalies without task-specific supervision. However, existing approaches typically rely on fixed textual prompts, which struggle to capture complex semantics, and focus solely on spatial-domain features, limiting their ability to detect subtle anomalies. To address these challenges, we propose a wavelet-enhanced mixture-of-experts prompt learning method for ZSAD. Specifically, a variational autoencoder is employed to model global semantic representations and integrate them into prompts to enhance adaptability to diverse anomaly patterns. Wavelet decomposition extracts multi-frequency image features that dynamically refine textual embeddings through cross-modal interactions. Furthermore, a semantic-aware mixture-of-experts module is introduced to aggregate contextual information. Extensive experiments on 14 industrial and medical datasets demonstrate the effectiveness of the proposed method.

WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection

TL;DR

Abstract

Paper Structure (11 sections, 10 equations, 2 figures, 2 tables)

This paper contains 11 sections, 10 equations, 2 figures, 2 tables.

Introduction
Methodology
Class Token Distribution Sampling
Wavelet-Enhanced Cross-Modal Attention
Semantic-Aware Mixture-of-Experts
Loss Function
Experiments
Experimental Setup
Main Results
Ablation Study
Conclusion

Figures (2)

Figure 1: Framework of WMoE-CLIP. CTDS leverages a VAE to model global semantic features, enhancing the adaptability of text embeddings. WCMA dynamically updates text embeddings using wavelet-based frequency features. SA-MoE employs a mixture-of-experts model to capture rich contextual information, further enhancing robust image-level anomaly scoring.
Figure 2: Comparative visualization of anomaly localization.

WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection

TL;DR

Abstract

WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)