Table of Contents
Fetching ...

DefFiller: Mask-Conditioned Diffusion for Salient Steel Surface Defect Generation

Yichun Tai, Zhenzhen Huang, Tao Peng, Zhijiang Zhang

TL;DR

DefFiller tackles data scarcity and pixel-level control in steel surface defect generation for saliency-based detection by introducing a mask-conditioned diffusion method built on GLIGEN. It integrates an auto-encoder, a mask encoder, and gated self-attention to produce defect regions that precisely follow given masks under text prompts, trained with a denoising objective and classifier-free guidance. Evaluation on the SD-Saliency-900 dataset shows high-fidelity, mask-consistent defect samples that improve downstream detection performance, outperforming baselines in both generation quality and utility for data expansion. The approach offers a practical, scalable pathway for industrial data augmentation and layout-conditioned diffusion in defect detection.

Abstract

Current saliency-based defect detection methods show promise in industrial settings, but the unpredictability of defects in steel production environments complicates dataset creation, hampering model performance. Existing data augmentation approaches using generative models often require pixel-level annotations, which are time-consuming and resource-intensive. To address this, we introduce DefFiller, a mask-conditioned defect generation method that leverages a layout-to-image diffusion model. DefFiller generates defect samples paired with mask conditions, eliminating the need for pixel-level annotations and enabling direct use in model training. We also develop an evaluation framework to assess the quality of generated samples and their impact on detection performance. Experimental results on the SD-Saliency-900 dataset demonstrate that DefFiller produces high-quality defect images that accurately match the provided mask conditions, significantly enhancing the performance of saliency-based defect detection models trained on the augmented dataset.

DefFiller: Mask-Conditioned Diffusion for Salient Steel Surface Defect Generation

TL;DR

DefFiller tackles data scarcity and pixel-level control in steel surface defect generation for saliency-based detection by introducing a mask-conditioned diffusion method built on GLIGEN. It integrates an auto-encoder, a mask encoder, and gated self-attention to produce defect regions that precisely follow given masks under text prompts, trained with a denoising objective and classifier-free guidance. Evaluation on the SD-Saliency-900 dataset shows high-fidelity, mask-consistent defect samples that improve downstream detection performance, outperforming baselines in both generation quality and utility for data expansion. The approach offers a practical, scalable pathway for industrial data augmentation and layout-conditioned diffusion in defect detection.

Abstract

Current saliency-based defect detection methods show promise in industrial settings, but the unpredictability of defects in steel production environments complicates dataset creation, hampering model performance. Existing data augmentation approaches using generative models often require pixel-level annotations, which are time-consuming and resource-intensive. To address this, we introduce DefFiller, a mask-conditioned defect generation method that leverages a layout-to-image diffusion model. DefFiller generates defect samples paired with mask conditions, eliminating the need for pixel-level annotations and enabling direct use in model training. We also develop an evaluation framework to assess the quality of generated samples and their impact on detection performance. Experimental results on the SD-Saliency-900 dataset demonstrate that DefFiller produces high-quality defect images that accurately match the provided mask conditions, significantly enhancing the performance of saliency-based defect detection models trained on the augmented dataset.

Paper Structure

This paper contains 29 sections, 4 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Overview of DefFiller. During training, only the parameters in the gated self-attention layers, the mask encoder and the downsmpling network are optimized. At inference, a random noise tensor $\mathbf{z}_T$ is sampled from a standard Gaussian distribution. With guidance from the text prompt, DefFiller generates defect samples that match the mask conditions.
  • Figure 2: The architecture of the mask encoder.
  • Figure 3: The illustration of mask-image pairs in SD-Saliency-900 dataset.
  • Figure 4: Visualization of generated images with different training strategies.
  • Figure 5: Visualization of generated images with different guidance scale $\omega_{cfg}$.
  • ...and 5 more figures