Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection
Yuanpeng Tu, Boshen Zhang, Liang Liu, Yuxi Li, Xuhai Chen, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Cai Rong Zhao
TL;DR
The paper tackles multi-modal 3D industrial anomaly detection by addressing domain gaps in transfered pretrained features. It introduces Local-to-global Self-supervised Feature Adaptation (LSFA), which jointly optimizes intra-modal feature compactness (IFC) and cross-modal local-to-global consistency (CLC) to produce task-oriented representations for RGB and 3D data. Using dynamic memory banks and multi-granularity signals, LSFA significantly improves anomaly localization and detection, achieving a new state-of-the-art on benchmarks like MVTec-3D AD with an I-AUROC of $97.1\%$ and strong results on Eyecandies. The method maintains efficiency by avoiding heavy fine-tuning and demonstrates robustness in few-shot regimes, offering practical benefits for industrial inspection systems.
Abstract
Industrial anomaly detection is generally addressed as an unsupervised task that aims at locating defects with only normal training samples. Recently, numerous 2D anomaly detection methods have been proposed and have achieved promising results, however, using only the 2D RGB data as input is not sufficient to identify imperceptible geometric surface anomalies. Hence, in this work, we focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets, i.e., ImageNet, to construct feature databases. And we empirically find that directly using these pre-trained models is not optimal, it can either fail to detect subtle defects or mistake abnormal features as normal ones. This may be attributed to the domain gap between target industrial data and source data.Towards this problem, we propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.Both intra-modal adaptation and cross-modal alignment are optimized from a local-to-global perspective in LSFA to ensure the representation quality and consistency in the inference stage.Extensive experiments demonstrate that our method not only brings a significant performance boost to feature embedding based approaches, but also outperforms previous State-of-The-Art (SoTA) methods prominently on both MVTec-3D AD and Eyecandies datasets, e.g., LSFA achieves 97.1% I-AUROC on MVTec-3D, surpass previous SoTA by +3.4%.
