Table of Contents
Fetching ...

MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection

Jun Yeong Park, JunYoung Seo, Minji Kang, Yu Rang Park

TL;DR

A Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics.

Abstract

The CLIP model's outstanding generalization has driven recent success in Zero-Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP's powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose MoECLIP, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. The code is available at https://github.com/CoCoRessa/MoECLIP.

MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection

TL;DR

A Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics.

Abstract

The CLIP model's outstanding generalization has driven recent success in Zero-Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP's powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose MoECLIP, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. The code is available at https://github.com/CoCoRessa/MoECLIP.
Paper Structure (46 sections, 22 equations, 35 figures, 15 tables, 1 algorithm)

This paper contains 46 sections, 22 equations, 35 figures, 15 tables, 1 algorithm.

Figures (35)

  • Figure 1: Comparison between existing CLIP-Adapter based method and our MoECLIP. (a) The general CLIP-based ZSAD framework. (b) Existing methods apply a uniform adaptation to all patches, regardless of their unique characteristics. (c) In contrast, our MoECLIP utilizes a Mixture of Experts to achieve patch-specialized adaptation, dynamically routing each patch to experts that are differentiated by FOFS and an ETF loss.
  • Figure 2: The framework of MoECLIP. MoE is integrated into multiple layers of the CLIP Vision Encoder, enabling dynamic expert routing for each image patch to learn patch-specific representations for ZSAD. Within each MoE, FOFS enforces expert specialization by orthogonally separating the feature space and ETF loss further enhances expert diversity by maximizing the equiangular separation of expert outputs. PAA then aggregates the refined patch features across multiple scales to capture anomalies of different sizes.
  • Figure 3: Visualization of Grad-CAM and Patch Selection Map for each expert at layer 18. The Ground Truth image is shown on the far left. The top row (Grad-CAM) highlights each expert's focus region. The bottom row (Patch Selection) illustrates the patches where the corresponding expert was the router's Top-1 choice (shown in green). The value in each subplot title represents the expert's average renormalized routing weight based on the Top-1 setting for its Top-1 assigned patches.
  • Figure 4: Visualization of Anomaly Maps comparing MoECLIP with previous ZSAD methods across industrial and medical domains. The first column shows the Ground Truth, and the remaining columns show anomaly maps from each method.
  • Figure 5: Ablation Study on Inter-Expert similarity heatmap at layer 18. The heatmap shows the average pairwise cosine similarity between expert features, computed on the MVTec-AD test set. Values approaching +1 (red) indicate high redundancy, while values approaching 0 (white) or negative values (blue) signify successful differentiation.
  • ...and 30 more figures