A Survey on Foundation-Model-Based Industrial Defect Detection
Tianle Yang, Luyao Chang, Jiadong Yan, Juntao Li, Zhi Wang, Ke Zhang
TL;DR
This survey analyzes foundation-model-based approaches for industrial defect detection in 2D and 3D, highlighting how FM (SAM, CLIP, GPT) enable few-shot/zero-shot detection through cross-modal prior knowledge, while NFMs remain valuable for efficiency and data-sparse contexts. It contrasts FM and NFM along training objectives, architectures, scaling, and performance, noting FM's superior handling of data scarcity but higher computational demands. The FM section catalogues 2D SAM-based, 2D CLIP-based, 2D GPT-based, and 3D CLIP-based methods, detailing how prompts, fine-grained alignment, and cross-modal fusion drive defect localization and interpretation. The NFM section surveys statistics-based methods, anomaly synthesis, 2D+3D fusion, and 3D generation techniques, offering insights that can inform FM development. Overall, FM methods shine in few-shot/zero-shot scenarios and cross-domain applicability, but challenges remain in inference speed and 3D performance, motivating hybrid and synthetic-data strategies to bridge the gap to practical deployment.
Abstract
As industrial products become abundant and sophisticated, visual industrial defect detection receives much attention, including two-dimensional and three-dimensional visual feature modeling. Traditional methods use statistical analysis, abnormal data synthesis modeling, and generation-based models to separate product defect features and complete defect detection. Recently, the emergence of foundation models has brought visual and textual semantic prior knowledge. Many methods are based on foundation models (FM) to improve the accuracy of detection, but at the same time, increase model complexity and slow down inference speed. Some FM-based methods have begun to explore lightweight modeling ways, which have gradually attracted attention and deserve to be systematically analyzed. In this paper, we conduct a systematic survey with comparisons and discussions of foundation model methods from different aspects and briefly review non-foundation model (NFM) methods recently published. Furthermore, we discuss the differences between FM and NFM methods from training objectives, model structure and scale, model performance, and potential directions for future exploration. Through comparison, we find FM methods are more suitable for few-shot and zero-shot learning, which are more in line with actual industrial application scenarios and worthy of in-depth research.
