Table of Contents
Fetching ...

A Survey on Foundation-Model-Based Industrial Defect Detection

Tianle Yang, Luyao Chang, Jiadong Yan, Juntao Li, Zhi Wang, Ke Zhang

TL;DR

This survey analyzes foundation-model-based approaches for industrial defect detection in 2D and 3D, highlighting how FM (SAM, CLIP, GPT) enable few-shot/zero-shot detection through cross-modal prior knowledge, while NFMs remain valuable for efficiency and data-sparse contexts. It contrasts FM and NFM along training objectives, architectures, scaling, and performance, noting FM's superior handling of data scarcity but higher computational demands. The FM section catalogues 2D SAM-based, 2D CLIP-based, 2D GPT-based, and 3D CLIP-based methods, detailing how prompts, fine-grained alignment, and cross-modal fusion drive defect localization and interpretation. The NFM section surveys statistics-based methods, anomaly synthesis, 2D+3D fusion, and 3D generation techniques, offering insights that can inform FM development. Overall, FM methods shine in few-shot/zero-shot scenarios and cross-domain applicability, but challenges remain in inference speed and 3D performance, motivating hybrid and synthetic-data strategies to bridge the gap to practical deployment.

Abstract

As industrial products become abundant and sophisticated, visual industrial defect detection receives much attention, including two-dimensional and three-dimensional visual feature modeling. Traditional methods use statistical analysis, abnormal data synthesis modeling, and generation-based models to separate product defect features and complete defect detection. Recently, the emergence of foundation models has brought visual and textual semantic prior knowledge. Many methods are based on foundation models (FM) to improve the accuracy of detection, but at the same time, increase model complexity and slow down inference speed. Some FM-based methods have begun to explore lightweight modeling ways, which have gradually attracted attention and deserve to be systematically analyzed. In this paper, we conduct a systematic survey with comparisons and discussions of foundation model methods from different aspects and briefly review non-foundation model (NFM) methods recently published. Furthermore, we discuss the differences between FM and NFM methods from training objectives, model structure and scale, model performance, and potential directions for future exploration. Through comparison, we find FM methods are more suitable for few-shot and zero-shot learning, which are more in line with actual industrial application scenarios and worthy of in-depth research.

A Survey on Foundation-Model-Based Industrial Defect Detection

TL;DR

This survey analyzes foundation-model-based approaches for industrial defect detection in 2D and 3D, highlighting how FM (SAM, CLIP, GPT) enable few-shot/zero-shot detection through cross-modal prior knowledge, while NFMs remain valuable for efficiency and data-sparse contexts. It contrasts FM and NFM along training objectives, architectures, scaling, and performance, noting FM's superior handling of data scarcity but higher computational demands. The FM section catalogues 2D SAM-based, 2D CLIP-based, 2D GPT-based, and 3D CLIP-based methods, detailing how prompts, fine-grained alignment, and cross-modal fusion drive defect localization and interpretation. The NFM section surveys statistics-based methods, anomaly synthesis, 2D+3D fusion, and 3D generation techniques, offering insights that can inform FM development. Overall, FM methods shine in few-shot/zero-shot scenarios and cross-domain applicability, but challenges remain in inference speed and 3D performance, motivating hybrid and synthetic-data strategies to bridge the gap to practical deployment.

Abstract

As industrial products become abundant and sophisticated, visual industrial defect detection receives much attention, including two-dimensional and three-dimensional visual feature modeling. Traditional methods use statistical analysis, abnormal data synthesis modeling, and generation-based models to separate product defect features and complete defect detection. Recently, the emergence of foundation models has brought visual and textual semantic prior knowledge. Many methods are based on foundation models (FM) to improve the accuracy of detection, but at the same time, increase model complexity and slow down inference speed. Some FM-based methods have begun to explore lightweight modeling ways, which have gradually attracted attention and deserve to be systematically analyzed. In this paper, we conduct a systematic survey with comparisons and discussions of foundation model methods from different aspects and briefly review non-foundation model (NFM) methods recently published. Furthermore, we discuss the differences between FM and NFM methods from training objectives, model structure and scale, model performance, and potential directions for future exploration. Through comparison, we find FM methods are more suitable for few-shot and zero-shot learning, which are more in line with actual industrial application scenarios and worthy of in-depth research.

Paper Structure

This paper contains 26 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Organization of surveyed methods. We categorize the methods under investigation into two main categories: foundation models and non-foundation models. Each category is further divided into 2D and 3D scenarios. The foundation-model-based methods primarily include methods based on SAM, CLIP, and GPT, while non-foundation-model-based methods are classified into static methods, synthesis-based methods, methods combining 2D RGB and 3D point clouds, and 3D generative methods. Finally, we present the latest methods collected in this survey.
  • Figure 2: A summary of the comparison between FM and NFM methods. We conduct a systematic comparison of the FM and NFM methods from the following 5 aspects: 1) Model Training Objectives. 2) Model Structure. 3) Model Scale. 4) Model Performance (AUROC Performance, Inference Time, and Computational Complexity). 5) Advantages and Challenges.
  • Figure 3: The left branch is framework of FM methods and the right one is of NFM methods. FM methods are primarily based on FM such as SAM, CLIP and GPT. During training, FM methods design appropriate loss functions to fine-tune the pre-trained foundational models, adapting them to the industrial defect detection domain. In contrast, NFM methods focus on designing task-specific models based on lightweight or specialized network architectures. Some NFM methods also design anomaly synthesis strategies to supplement training data.
  • Figure 4: Representative methods along the development of FM and NFM models. The orange box illustrates the evolution of FM methods. WinCLIP introduced the use of prompt ensemble and multi-scale feature extraction with CLIP. Subsequently, SAA+ and Anomaly GPT incorporated SAM and GPT techniques, fostering the exploration of cross-modal approaches exemplified by ClipSAM. 3D FM methods emerged later, with CLIP3D-AD and PointAD focusing on addressing inconsistencies in multimodal data. Recently, 2D FM methods have achieved improvements in inference speed and accuracy, such as STLM based on a teacher-student framework and CLIP-FSAC employing vision-driven textual strategies.The green box presents the progression of NFM methods. The early method Back to the Future proposed handcrafted 3D representations but suffered from low efficiency and accuracy. Diffusion-based approaches, including TransFusion, AnomalyDiffusion, and AnomalyXFusion, effectively addressed these issues. In recent years, 3D generative techniques have been explored, with efforts concentrated on enhancing computational and storage efficiency.