Bring the Power of Diffusion Model to Defect Detection
Xuyi Yu
TL;DR
This work tackles industrial surface defect detection by injecting the high-order semantic information inherent in diffusion models into a lightweight detector. It builds a memory-efficient feature repository from pre-trained DDPM activations and compresses it with ResVAE, then fuses these features with the detector through dynamic cross-attention and FFT-based noise filtering, followed by knowledge distillation to preserve efficiency. The approach yields improved detection accuracy and segmentation performance across NEU-DET, GC10-DET, and Tianchi Fabric datasets, while maintaining or reducing inference cost. Overall, the method demonstrates a practical pathway to apply diffusion-model semantics in real-time defect detection, with strong ablation-supported evidence of component contributions and potential for broader industrial impact.
Abstract
Due to the high complexity and technical requirements of industrial production processes, surface defects will inevitably appear, which seriously affects the quality of products. Although existing lightweight detection networks are highly efficient, they are susceptible to false or missed detection of non-salient defects due to the lack of semantic information. In contrast, the diffusion model can generate higher-order semantic representations in the denoising process. Therefore, the aim of this paper is to incorporate the higher-order modelling capability of the diffusion model into the detection model, so as to better assist in the classification and localization of difficult targets. First, the denoising diffusion probabilistic model (DDPM) is pre-trained to extract the features of denoising process to construct as a feature repository. In particular, to avoid the potential bottleneck of memory caused by the dataloader loading high-dimensional features, a residual convolutional variational auto-encoder (ResVAE) is designed to further compress the feature repository. The image is fed into both image backbone and feature repository for feature extraction and querying respectively. The queried latent features are reconstructed and filtered to obtain high-dimensional DDPM features. A dynamic cross-fusion method is proposed to fully refine the contextual features of DDPM to optimize the detection model. Finally, we employ knowledge distillation to migrate the higher-order modelling capabilities back into the lightweight baseline model without additional efficiency cost. Experiment results demonstrate that our method achieves competitive results on several industrial datasets.
