Table of Contents
Fetching ...

MINet: Multi-scale Interactive Network for Real-time Salient Object Detection of Strip Steel Surface Defects

Kunye Shen, Xiaofei Zhou, Zhi Liu

TL;DR

MINet introduces a Multi-scale Interactive (MI) module that embeds multi-scale feature extraction and interaction into depthwise separable convolutions, enabling a lightweight encoder-decoder architecture for real-time salient object detection of strip steel surface defects. The MI-based backbone and the MINet decoder achieve a strong speed-accuracy trade-off, delivering up to 721 FPS on GPU with only 0.28M parameters while maintaining competitive defect-detection performance on the SD-Saliency-900 dataset. The method uses a hybrid BCE-SSIM loss with deep supervision to sharpen defect boundaries and overall saliency accuracy. Practically, MINet offers industrial viability for real-time defect inspection and provides a plug-and-play module that can be extended to other visual inspection and saliency tasks, with potential improvements through multi-modal data integration and model compression techniques.

Abstract

The automated surface defect detection is a fundamental task in industrial production, and the existing saliencybased works overcome the challenging scenes and give promising detection results. However, the cutting-edge efforts often suffer from large parameter size, heavy computational cost, and slow inference speed, which heavily limits the practical applications. To this end, we devise a multi-scale interactive (MI) module, which employs depthwise convolution (DWConv) and pointwise convolution (PWConv) to independently extract and interactively fuse features of different scales, respectively. Particularly, the MI module can provide satisfactory characterization for defect regions with fewer parameters. Embarking on this module, we propose a lightweight Multi-scale Interactive Network (MINet) to conduct real-time salient object detection of strip steel surface defects. Comprehensive experimental results on SD-Saliency-900 dataset, which contains three kinds of strip steel surface defect detection images (i.e., inclusion, patches, and scratches), demonstrate that the proposed MINet presents comparable detection accuracy with the state-of-the-art methods while running at a GPU speed of 721FPS and a CPU speed of 6.3FPS for 368*368 images with only 0.28M parameters. The code is available at https://github.com/Kunye-Shen/MINet.

MINet: Multi-scale Interactive Network for Real-time Salient Object Detection of Strip Steel Surface Defects

TL;DR

MINet introduces a Multi-scale Interactive (MI) module that embeds multi-scale feature extraction and interaction into depthwise separable convolutions, enabling a lightweight encoder-decoder architecture for real-time salient object detection of strip steel surface defects. The MI-based backbone and the MINet decoder achieve a strong speed-accuracy trade-off, delivering up to 721 FPS on GPU with only 0.28M parameters while maintaining competitive defect-detection performance on the SD-Saliency-900 dataset. The method uses a hybrid BCE-SSIM loss with deep supervision to sharpen defect boundaries and overall saliency accuracy. Practically, MINet offers industrial viability for real-time defect inspection and provides a plug-and-play module that can be extended to other visual inspection and saliency tasks, with potential improvements through multi-modal data integration and model compression techniques.

Abstract

The automated surface defect detection is a fundamental task in industrial production, and the existing saliencybased works overcome the challenging scenes and give promising detection results. However, the cutting-edge efforts often suffer from large parameter size, heavy computational cost, and slow inference speed, which heavily limits the practical applications. To this end, we devise a multi-scale interactive (MI) module, which employs depthwise convolution (DWConv) and pointwise convolution (PWConv) to independently extract and interactively fuse features of different scales, respectively. Particularly, the MI module can provide satisfactory characterization for defect regions with fewer parameters. Embarking on this module, we propose a lightweight Multi-scale Interactive Network (MINet) to conduct real-time salient object detection of strip steel surface defects. Comprehensive experimental results on SD-Saliency-900 dataset, which contains three kinds of strip steel surface defect detection images (i.e., inclusion, patches, and scratches), demonstrate that the proposed MINet presents comparable detection accuracy with the state-of-the-art methods while running at a GPU speed of 721FPS and a CPU speed of 6.3FPS for 368*368 images with only 0.28M parameters. The code is available at https://github.com/Kunye-Shen/MINet.
Paper Structure (29 sections, 9 equations, 7 figures, 4 tables)

This paper contains 29 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Visual comparison between our MINet and 3 state-of-the-art strip steel surface defect detection methods. Parameters(M), FLOPs(G), and Speed(FPS) are marked in red, green, and blue, respectively.
  • Figure 2: Illustration of three types of the fusion of multi-scale features, including (a) Concatenation, (b) Summation, and (c) Summation with other operations.
  • Figure 3: Illustration of the proposed MI module. The input feature $\mathbf{F}_{in}\in \mathbb{R}^{c\times h\times w}$ is processed by four DWConvs ($f_{DW}\in \mathbb{R}^{c\times 3\times 3}$) to acquire multi-scale features $\{\mathbf{F}_{i}^{M}\}_{i=1}^{4}\in \mathbb{R}^{c\times h\times w}$ during the first stage. After that, enhanced multi-scale features $\{\mathbf{F}_{i}^{EM}\}_{i=1}^{c}\in \mathbb{R}^{1\times h\times w}$ are obtained by PWConv ($f_{PW}\in \mathbb{R}^{4\times 1\times 1}$). And then, PWConv ($f_{PW}\in \mathbb{R}^{c\times 1\times 1}$) is used to fuse enhanced multi-scale features across channels. Following the residual structure, we finally obtain the $\mathbf{F}_{out}\in \mathbb{R}^{c\times h\times w}$ of our MI module.
  • Figure 4: Architecture of the MI-based real-time backbone. Our MI-based real-time backbone comprises five stages.
  • Figure 5: Overall architecture of our MINet. $\{\mathbf{F}_i\}_{i=1}^5$ and $\{\mathbf{F}_i^D\}_{i=1}^5$ represents the outputs of Encoder-$i$ and Decoder-$i$, respectively.
  • ...and 2 more figures