Table of Contents
Fetching ...

Bring the Power of Diffusion Model to Defect Detection

Xuyi Yu

TL;DR

This work tackles industrial surface defect detection by injecting the high-order semantic information inherent in diffusion models into a lightweight detector. It builds a memory-efficient feature repository from pre-trained DDPM activations and compresses it with ResVAE, then fuses these features with the detector through dynamic cross-attention and FFT-based noise filtering, followed by knowledge distillation to preserve efficiency. The approach yields improved detection accuracy and segmentation performance across NEU-DET, GC10-DET, and Tianchi Fabric datasets, while maintaining or reducing inference cost. Overall, the method demonstrates a practical pathway to apply diffusion-model semantics in real-time defect detection, with strong ablation-supported evidence of component contributions and potential for broader industrial impact.

Abstract

Due to the high complexity and technical requirements of industrial production processes, surface defects will inevitably appear, which seriously affects the quality of products. Although existing lightweight detection networks are highly efficient, they are susceptible to false or missed detection of non-salient defects due to the lack of semantic information. In contrast, the diffusion model can generate higher-order semantic representations in the denoising process. Therefore, the aim of this paper is to incorporate the higher-order modelling capability of the diffusion model into the detection model, so as to better assist in the classification and localization of difficult targets. First, the denoising diffusion probabilistic model (DDPM) is pre-trained to extract the features of denoising process to construct as a feature repository. In particular, to avoid the potential bottleneck of memory caused by the dataloader loading high-dimensional features, a residual convolutional variational auto-encoder (ResVAE) is designed to further compress the feature repository. The image is fed into both image backbone and feature repository for feature extraction and querying respectively. The queried latent features are reconstructed and filtered to obtain high-dimensional DDPM features. A dynamic cross-fusion method is proposed to fully refine the contextual features of DDPM to optimize the detection model. Finally, we employ knowledge distillation to migrate the higher-order modelling capabilities back into the lightweight baseline model without additional efficiency cost. Experiment results demonstrate that our method achieves competitive results on several industrial datasets.

Bring the Power of Diffusion Model to Defect Detection

TL;DR

This work tackles industrial surface defect detection by injecting the high-order semantic information inherent in diffusion models into a lightweight detector. It builds a memory-efficient feature repository from pre-trained DDPM activations and compresses it with ResVAE, then fuses these features with the detector through dynamic cross-attention and FFT-based noise filtering, followed by knowledge distillation to preserve efficiency. The approach yields improved detection accuracy and segmentation performance across NEU-DET, GC10-DET, and Tianchi Fabric datasets, while maintaining or reducing inference cost. Overall, the method demonstrates a practical pathway to apply diffusion-model semantics in real-time defect detection, with strong ablation-supported evidence of component contributions and potential for broader industrial impact.

Abstract

Due to the high complexity and technical requirements of industrial production processes, surface defects will inevitably appear, which seriously affects the quality of products. Although existing lightweight detection networks are highly efficient, they are susceptible to false or missed detection of non-salient defects due to the lack of semantic information. In contrast, the diffusion model can generate higher-order semantic representations in the denoising process. Therefore, the aim of this paper is to incorporate the higher-order modelling capability of the diffusion model into the detection model, so as to better assist in the classification and localization of difficult targets. First, the denoising diffusion probabilistic model (DDPM) is pre-trained to extract the features of denoising process to construct as a feature repository. In particular, to avoid the potential bottleneck of memory caused by the dataloader loading high-dimensional features, a residual convolutional variational auto-encoder (ResVAE) is designed to further compress the feature repository. The image is fed into both image backbone and feature repository for feature extraction and querying respectively. The queried latent features are reconstructed and filtered to obtain high-dimensional DDPM features. A dynamic cross-fusion method is proposed to fully refine the contextual features of DDPM to optimize the detection model. Finally, we employ knowledge distillation to migrate the higher-order modelling capabilities back into the lightweight baseline model without additional efficiency cost. Experiment results demonstrate that our method achieves competitive results on several industrial datasets.
Paper Structure (27 sections, 8 equations, 18 figures, 7 tables)

This paper contains 27 sections, 8 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Diffusion models smoothly perturb data by adding noise, then reverse this process to generate new data from noise.
  • Figure 2: The overall pipeline of the proposed method, which can be divided into three stages. The first stage is feature preparation, where a more condensed feature repository is constructed by unsupervised pre-training of the DDPM and VAE based on the input images and internal features respectively, where the circular arrows represent that the component needs to be pre-trained. The second stage is cross-model fusion, where the DDPM features from the repository are fused into the detection model for enhancement. The third stage is the power transfer, where the power of hybrid model is re-transferred to the baseline model.
  • Figure 3: The process of feature preparation. Firstly, the input image $X_0$ is added with noise according to the noise schedule to get $X_t$. Then it is input to the pre-training DDPM for noise prediction, where the intermediate features of the UNet are collected in the feature repository. Finally, the feature repository is used as a data source to iteratively train and update the feature compressor (ResVAE) to obtain a more compact feature repository.
  • Figure 4: The structure of cross-model fusion. The input image is fed into both image backbone and feature repository to extract features respectively. The features from the feature repository are feature mapped to reconstruct the high-dimensional DDPM features. Subsequent noise filter is employed to further filter the high frequency noise from the features in the frequency domain. Finally, dynamic cross-fusion of features from different models is performed.
  • Figure 5: The process of feature mapping. A fixed-parameter decoder is used to convert latent features to high-dimensional features, where the grey part indicates no involvement in the forward process.
  • ...and 13 more figures