Table of Contents
Fetching ...

MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples

Xurui Li, Feng Xue, Yu Zhou

TL;DR

MuSc-V2 tackles industrial zero-shot anomaly classification and segmentation in multimodal data by exploiting the abundant normal patch correspondences across unlabeled samples in both 2D and 3D. It introduces IPG for robust 3D grouping, SNAMD for multi-scale neighborhood aggregation with SWPooling, and a Mutual Scoring Mechanism (MSM) complemented by Cross-modal Anomaly Enhancement (CAE) and Re-Scoring with Constrained Neighborhood (RsCon) to achieve training-free, cross-sample anomaly scoring and robust segmentation. The approach demonstrates substantial improvements over state-of-the-art zero-shot methods (+$23.7 ext{ AP}$ on MVTec 3D-AD and +$19.3 ext{ AP}$ on Eyecandies) and gains in AUROC for anomaly classification, while maintaining robustness on smaller subsets and varying normal-sample ratios. These results suggest strong practical potential for deployment on production lines without labeled data or task-specific prompts, offering scalable, multimodal anomaly detection and localization across diverse industrial contexts.

Abstract

Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7\%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3\%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.

MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples

TL;DR

MuSc-V2 tackles industrial zero-shot anomaly classification and segmentation in multimodal data by exploiting the abundant normal patch correspondences across unlabeled samples in both 2D and 3D. It introduces IPG for robust 3D grouping, SNAMD for multi-scale neighborhood aggregation with SWPooling, and a Mutual Scoring Mechanism (MSM) complemented by Cross-modal Anomaly Enhancement (CAE) and Re-Scoring with Constrained Neighborhood (RsCon) to achieve training-free, cross-sample anomaly scoring and robust segmentation. The approach demonstrates substantial improvements over state-of-the-art zero-shot methods (+ on MVTec 3D-AD and + on Eyecandies) and gains in AUROC for anomaly classification, while maintaining robustness on smaller subsets and varying normal-sample ratios. These results suggest strong practical potential for deployment on production lines without labeled data or task-specific prompts, offering scalable, multimodal anomaly detection and localization across diverse industrial contexts.

Abstract

Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a AP gain on the MVTec 3D-AD dataset and a boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.

Paper Structure

This paper contains 30 sections, 13 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: (a) Zero-shot AC/AS methods for 2D modal. (b) Zero-shot AC/AS methods for 3D modal. These CLIP-based methods require additional text prompts and fine-tuning on additional industrial datasets. (c) Our MuSc-V2 is the first multimodal zero-shot method without any prompts or training.
  • Figure 2: The pipeline of our MuSc-V2. This framework processes 2D images and 3D point clouds through four important innovations: (1) IPG replaces the current grouping strategy in the point transformer to generate groups with continuous surfaces (Sec. \ref{['sec:2d_feature']}). (2) SNAMD improves the abnormal modeling ability with varying sizes for both modals (Sec. \ref{['sec:3d_feature']}). (3) MSM obtains anomaly segmentation results of 2D/3D modals. CAE enhances scores of anomalies if both modals are available (Sec. \ref{['sec:m3sm']}). (4) RsCon reduces false anomaly classification from local noise and weak anomalies (Sec. \ref{['sec:rscon']}).
  • Figure 3: Toy example of searching $K_\textbf{P}$ points for the center point $p_\textbf{c}$. The green lines and regions represent the candidate points, and the blue ones indicate the searched points as the group points of $p_\textbf{c}$.
  • Figure 4: Similarity-Weighted Pooling (SWPooling) Versus Average Pooling (APooling). Top: One toy example represents feature maps aggregated by two aggregation methods, where blue patches and red patches simulate normal and abnormal tokens, respectively. Bottom: The visualization of segmentation results with SWPooling and APooling by one real example.
  • Figure 5: (a-b) Score distributions $A_\textbf{I}^{i,s,m}$ for normal/abnormal 2D patches. (c-d) Corresponding score distributions for 3D patch. (e-f) Comparison of $\overline{a}_\textbf{I}^{i,s,m}$ distributions without/with Interval Average (IA) operation.
  • ...and 6 more figures