Table of Contents
Fetching ...

M3DM-NR: RGB-3D Noisy-Resistant Industrial Anomaly Detection via Multimodal Denoising

Chengjie Wang, Haokun Zhu, Jinlong Peng, Yue Wang, Ran Yi, Yunsheng Wu, Lizhuang Ma, Jiangning Zhang

TL;DR

This work tackles RGB-3D industrial anomaly detection under realistic noisy training data by introducing M3DM-NR, a three-stage framework built on CLIP and Point-BIND to enable robust cross-modal denoising and fusion. Stage I selects intra-modal normal references and generates suspected anomaly maps; Stage II utilizes multi-scale intra- and cross-modal comparisons with these maps to denoise training data; Stage III learns the data distribution through a hybrid fusion scheme with memory banks and OC-SVMs to perform accurate anomaly detection and segmentation. The approach achieves state-of-the-art results on MVTec-3D-AD and Eyecandies across regular and noisy settings, and extensive ablations validate the effectiveness of each module, including stage-wise denoising, multi-scale point cloud processing, and LOF-based patch filtering. The proposed framework offers a practical, robust solution for real-world industrial inspection with noisy multi-modal data and demonstrates the value of leveraging pretrained vision-language models for multimodal anomaly detection.

Abstract

Existing industrial anomaly detection methods primarily concentrate on unsupervised learning with pristine RGB images. Yet, both RGB and 3D data are crucial for anomaly detection, and the datasets are seldom completely clean in practical scenarios. To address above challenges, this paper initially delves into the RGB-3D multi-modal noisy anomaly detection, proposing a novel noise-resistant M3DM-NR framework to leveraging strong multi-modal discriminative capabilities of CLIP. M3DM-NR consists of three stages: Stage-I introduces the Suspected References Selection module to filter a few normal samples from the training dataset, using the multimodal features extracted by the Initial Feature Extraction, and a Suspected Anomaly Map Computation module to generate a suspected anomaly map to focus on abnormal regions as reference. Stage-II uses the suspected anomaly maps of the reference samples as reference, and inputs image, point cloud, and text information to achieve denoising of the training samples through intra-modal comparison and multi-scale aggregation operations. Finally, Stage-III proposes the Point Feature Alignment, Unsupervised Feature Fusion, Noise Discriminative Coreset Selection, and Decision Layer Fusion modules to learn the pattern of the training dataset, enabling anomaly detection and segmentation while filtering out noise. Extensive experiments show that M3DM-NR outperforms state-of-the-art methods in 3D-RGB multi-modal noisy anomaly detection.

M3DM-NR: RGB-3D Noisy-Resistant Industrial Anomaly Detection via Multimodal Denoising

TL;DR

This work tackles RGB-3D industrial anomaly detection under realistic noisy training data by introducing M3DM-NR, a three-stage framework built on CLIP and Point-BIND to enable robust cross-modal denoising and fusion. Stage I selects intra-modal normal references and generates suspected anomaly maps; Stage II utilizes multi-scale intra- and cross-modal comparisons with these maps to denoise training data; Stage III learns the data distribution through a hybrid fusion scheme with memory banks and OC-SVMs to perform accurate anomaly detection and segmentation. The approach achieves state-of-the-art results on MVTec-3D-AD and Eyecandies across regular and noisy settings, and extensive ablations validate the effectiveness of each module, including stage-wise denoising, multi-scale point cloud processing, and LOF-based patch filtering. The proposed framework offers a practical, robust solution for real-world industrial inspection with noisy multi-modal data and demonstrates the value of leveraging pretrained vision-language models for multimodal anomaly detection.

Abstract

Existing industrial anomaly detection methods primarily concentrate on unsupervised learning with pristine RGB images. Yet, both RGB and 3D data are crucial for anomaly detection, and the datasets are seldom completely clean in practical scenarios. To address above challenges, this paper initially delves into the RGB-3D multi-modal noisy anomaly detection, proposing a novel noise-resistant M3DM-NR framework to leveraging strong multi-modal discriminative capabilities of CLIP. M3DM-NR consists of three stages: Stage-I introduces the Suspected References Selection module to filter a few normal samples from the training dataset, using the multimodal features extracted by the Initial Feature Extraction, and a Suspected Anomaly Map Computation module to generate a suspected anomaly map to focus on abnormal regions as reference. Stage-II uses the suspected anomaly maps of the reference samples as reference, and inputs image, point cloud, and text information to achieve denoising of the training samples through intra-modal comparison and multi-scale aggregation operations. Finally, Stage-III proposes the Point Feature Alignment, Unsupervised Feature Fusion, Noise Discriminative Coreset Selection, and Decision Layer Fusion modules to learn the pattern of the training dataset, enabling anomaly detection and segmentation while filtering out noise. Extensive experiments show that M3DM-NR outperforms state-of-the-art methods in 3D-RGB multi-modal noisy anomaly detection.
Paper Structure (24 sections, 22 equations, 8 figures, 26 tables)

This paper contains 24 sections, 22 equations, 8 figures, 26 tables.

Figures (8)

  • Figure 1: Top: Intuitive diagram of different task settings. Middle: Representative PatchCore patchcore for solving RGB images, our M3DM wang2023multimodal (conference version) for solving multi-modal RGB+3D data, and new M3DM-NR to tackle more challenging and practial noisy setting. Bottom: Quantitative visualization results on MVTec 3D-AD dataset mvtec3dad. Our M3DM-NR can predict more precise anomaly regions obviously compared to PatchCore+FPFH horwitz2023back and M3DM wang2023multimodal.
  • Figure 2: Overall pipeline of our M3DM-NR that comprises three stages: 1) selecting intra-modal reference samples, 2) denoising the dataset by comparing it with these samples, and 3) achieving multimodal anomaly detection through multimodal feature fusion.
  • Figure 3: Overview framework of our M3DM-NR, which contains three stages to tackle challenging noisy anomaly detection task: Stage I introduces a text prompt ensemble strategy $\varphi_T$, utilizing pre-trained image encoder $E_{I}$, point cloud encoder $E_{P}$, and text encoder $E_{T}$ to extract initial features $\left\{F_{I_m}\right\}_{m=1}^M$, $\left\{F_{P_m}\right\}_{m=1}^M$, $f_{T}^{Nor}$, and $f_{T}^{Ano}$. These features are then used to select suspected reference samples $\left\{s_{R_n}\right\}_{n=1}^N$ through similarity measurement and to compute corresponding anomaly maps $\left\{W_n\right\}_{n=1}^N$. Based on the suspected samples, Stage II calculates the anomaly scores $\left\{\tilde{S}_m\right\}_{m=1}^M$ for each training sample using multi-scale and feature weighting methods, ultimately filtering out the top-$\tau$ samples to obtain a denoised training set. Stage III comprises four modules to achieve final anomaly detection and segmentation.
  • Figure 4: Visualization of Aligned Multi-Scale Point Cloud Feature Extraction (AMPCFE), which extracts local point cloud features aligned with the granularity of image patching, focusing more on local details and improving the efficacy of multi-modal anomaly detection.
  • Figure 5: Detailed Explanation of multi-scale suspected anomaly score computation, which focuses more on the patches containing anomalies and less on those without when computing the intra-modal suspected anomaly scores to enhance the accuracy of anomaly detection.
  • ...and 3 more figures