Automated Detection of Defects on Metal Surfaces using Vision Transformers
Toqa Alaa, Mostafa Kotb, Arwa Zakaria, Mariam Diab, Walid Gomaa
TL;DR
This paper tackles automated defect detection on metal surfaces by employing a Vision Transformer (ViT) backbone in a dual-branch architecture for simultaneous defect classification and localization. It introduces Multi-DET as a dense, diverse dataset and leverages a ViT encoder with a CNN and shared MLPs, using anchor boxes to handle multiple defects per image. The approach achieves high classification accuracy and precise localization (e.g., 93.5% accuracy, ~3.2 pixel MAE, ~0.72 mean IoU) and demonstrates reduced overfitting compared to CNN baselines, highlighting the practical potential for improving manufacturing quality control. Limitations include handling highly irregular defect shapes and achieving real-time performance, pointing to future work in dataset expansion and ViT optimization for speed and robustness.
Abstract
Metal manufacturing often results in the production of defective products, leading to operational challenges. Since traditional manual inspection is time-consuming and resource-intensive, automatic solutions are needed. The study utilizes deep learning techniques to develop a model for detecting metal surface defects using Vision Transformers (ViTs). The proposed model focuses on the classification and localization of defects using a ViT for feature extraction. The architecture branches into two paths: classification and localization. The model must approach high classification accuracy while keeping the Mean Square Error (MSE) and Mean Absolute Error (MAE) as low as possible in the localization process. Experimental results show that it can be utilized in the process of automated defects detection, improve operational efficiency, and reduce errors in metal manufacturing.
