Table of Contents
Fetching ...

PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixtures

Yuheng Shao, Lizhang Wang, Changhao Li, Peixian Chen, Qinyuan Liu

TL;DR

PromptMoE reframes prompt learning for zero-shot anomaly detection as a compositional, mixture-of-experts problem. By using a Visually-Guided Mixture of Prompt (VGMoP) with separate normal and abnormal prompt pools and layer-wise routing, the method constructs instance-specific textual representations that improve generalization to unseen anomaly patterns. The approach achieves state-of-the-art results across 15 industrial and medical datasets, aided by auxiliary losses that balance expert usage and promote diversity. This work demonstrates that learning how to compose prompts, rather than relying on monolithic prompts, significantly enhances robust ZSAD and anomaly localization in diverse domains.

Abstract

Zero-Shot Anomaly Detection (ZSAD) aims to identify and localize anomalous regions in images of unseen object classes. While recent methods based on vision-language models like CLIP show promise, their performance is constrained by existing prompt engineering strategies. Current approaches, whether relying on single fixed, learnable, or dense dynamic prompts, suffer from a representational bottleneck and are prone to overfitting on auxiliary data, failing to generalize to the complexity and diversity of unseen anomalies. To overcome these limitations, we propose $\mathtt{PromptMoE}$. Our core insight is that robust ZSAD requires a compositional approach to prompt learning. Instead of learning monolithic prompts, $\mathtt{PromptMoE}$ learns a pool of expert prompts, which serve as a basis set of composable semantic primitives, and a visually-guided Mixture-of-Experts (MoE) mechanism to dynamically combine them for each instance. Our framework materializes this concept through a Visually-Guided Mixture of Prompt (VGMoP) that employs an image-gated sparse MoE to aggregate diverse normal and abnormal expert state prompts, generating semantically rich textual representations with strong generalization. Extensive experiments across 15 datasets in industrial and medical domains demonstrate the effectiveness and state-of-the-art performance of $\mathtt{PromptMoE}$.

PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixtures

TL;DR

PromptMoE reframes prompt learning for zero-shot anomaly detection as a compositional, mixture-of-experts problem. By using a Visually-Guided Mixture of Prompt (VGMoP) with separate normal and abnormal prompt pools and layer-wise routing, the method constructs instance-specific textual representations that improve generalization to unseen anomaly patterns. The approach achieves state-of-the-art results across 15 industrial and medical datasets, aided by auxiliary losses that balance expert usage and promote diversity. This work demonstrates that learning how to compose prompts, rather than relying on monolithic prompts, significantly enhances robust ZSAD and anomaly localization in diverse domains.

Abstract

Zero-Shot Anomaly Detection (ZSAD) aims to identify and localize anomalous regions in images of unseen object classes. While recent methods based on vision-language models like CLIP show promise, their performance is constrained by existing prompt engineering strategies. Current approaches, whether relying on single fixed, learnable, or dense dynamic prompts, suffer from a representational bottleneck and are prone to overfitting on auxiliary data, failing to generalize to the complexity and diversity of unseen anomalies. To overcome these limitations, we propose . Our core insight is that robust ZSAD requires a compositional approach to prompt learning. Instead of learning monolithic prompts, learns a pool of expert prompts, which serve as a basis set of composable semantic primitives, and a visually-guided Mixture-of-Experts (MoE) mechanism to dynamically combine them for each instance. Our framework materializes this concept through a Visually-Guided Mixture of Prompt (VGMoP) that employs an image-gated sparse MoE to aggregate diverse normal and abnormal expert state prompts, generating semantically rich textual representations with strong generalization. Extensive experiments across 15 datasets in industrial and medical domains demonstrate the effectiveness and state-of-the-art performance of .

Paper Structure

This paper contains 36 sections, 8 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: (a) Monolithic Prompt vs. (b) Compositional Prompt. Unlike monolithic prompts with fixed representations, our compositional prompts employs visual querying to dynamically aggregate expert prompts, treated as semantic primitives, into an instance-specific textual representation.
  • Figure 2: ZSAD performance of $\mathtt{PromptMoE}$ compared to state-of-the-art methods. Left: I-AUROC. Right: P-PRO.
  • Figure 3: Framework of $\mathtt{PromptMoE}$.
  • Figure 4: Architecture of our proposed Visually-Guided Mixture of Prompt (VGMoP) module. A sparse router, guided by cross-attention between learnable queries and image features, dynamically selects and aggregates top-k experts from respective prompt pools to construct instance-specific normal and abnormal text prompts.
  • Figure 5: Qualitative comparison of anomaly localization results across different ZSAD methods.
  • ...and 3 more figures