M3DM-NR: RGB-3D Noisy-Resistant Industrial Anomaly Detection via Multimodal Denoising
Chengjie Wang, Haokun Zhu, Jinlong Peng, Yue Wang, Ran Yi, Yunsheng Wu, Lizhuang Ma, Jiangning Zhang
TL;DR
This work tackles RGB-3D industrial anomaly detection under realistic noisy training data by introducing M3DM-NR, a three-stage framework built on CLIP and Point-BIND to enable robust cross-modal denoising and fusion. Stage I selects intra-modal normal references and generates suspected anomaly maps; Stage II utilizes multi-scale intra- and cross-modal comparisons with these maps to denoise training data; Stage III learns the data distribution through a hybrid fusion scheme with memory banks and OC-SVMs to perform accurate anomaly detection and segmentation. The approach achieves state-of-the-art results on MVTec-3D-AD and Eyecandies across regular and noisy settings, and extensive ablations validate the effectiveness of each module, including stage-wise denoising, multi-scale point cloud processing, and LOF-based patch filtering. The proposed framework offers a practical, robust solution for real-world industrial inspection with noisy multi-modal data and demonstrates the value of leveraging pretrained vision-language models for multimodal anomaly detection.
Abstract
Existing industrial anomaly detection methods primarily concentrate on unsupervised learning with pristine RGB images. Yet, both RGB and 3D data are crucial for anomaly detection, and the datasets are seldom completely clean in practical scenarios. To address above challenges, this paper initially delves into the RGB-3D multi-modal noisy anomaly detection, proposing a novel noise-resistant M3DM-NR framework to leveraging strong multi-modal discriminative capabilities of CLIP. M3DM-NR consists of three stages: Stage-I introduces the Suspected References Selection module to filter a few normal samples from the training dataset, using the multimodal features extracted by the Initial Feature Extraction, and a Suspected Anomaly Map Computation module to generate a suspected anomaly map to focus on abnormal regions as reference. Stage-II uses the suspected anomaly maps of the reference samples as reference, and inputs image, point cloud, and text information to achieve denoising of the training samples through intra-modal comparison and multi-scale aggregation operations. Finally, Stage-III proposes the Point Feature Alignment, Unsupervised Feature Fusion, Noise Discriminative Coreset Selection, and Decision Layer Fusion modules to learn the pattern of the training dataset, enabling anomaly detection and segmentation while filtering out noise. Extensive experiments show that M3DM-NR outperforms state-of-the-art methods in 3D-RGB multi-modal noisy anomaly detection.
