PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixtures
Yuheng Shao, Lizhang Wang, Changhao Li, Peixian Chen, Qinyuan Liu
TL;DR
PromptMoE reframes prompt learning for zero-shot anomaly detection as a compositional, mixture-of-experts problem. By using a Visually-Guided Mixture of Prompt (VGMoP) with separate normal and abnormal prompt pools and layer-wise routing, the method constructs instance-specific textual representations that improve generalization to unseen anomaly patterns. The approach achieves state-of-the-art results across 15 industrial and medical datasets, aided by auxiliary losses that balance expert usage and promote diversity. This work demonstrates that learning how to compose prompts, rather than relying on monolithic prompts, significantly enhances robust ZSAD and anomaly localization in diverse domains.
Abstract
Zero-Shot Anomaly Detection (ZSAD) aims to identify and localize anomalous regions in images of unseen object classes. While recent methods based on vision-language models like CLIP show promise, their performance is constrained by existing prompt engineering strategies. Current approaches, whether relying on single fixed, learnable, or dense dynamic prompts, suffer from a representational bottleneck and are prone to overfitting on auxiliary data, failing to generalize to the complexity and diversity of unseen anomalies. To overcome these limitations, we propose $\mathtt{PromptMoE}$. Our core insight is that robust ZSAD requires a compositional approach to prompt learning. Instead of learning monolithic prompts, $\mathtt{PromptMoE}$ learns a pool of expert prompts, which serve as a basis set of composable semantic primitives, and a visually-guided Mixture-of-Experts (MoE) mechanism to dynamically combine them for each instance. Our framework materializes this concept through a Visually-Guided Mixture of Prompt (VGMoP) that employs an image-gated sparse MoE to aggregate diverse normal and abnormal expert state prompts, generating semantically rich textual representations with strong generalization. Extensive experiments across 15 datasets in industrial and medical domains demonstrate the effectiveness and state-of-the-art performance of $\mathtt{PromptMoE}$.
