SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection

Shuxian Zhao; Jie Gui; Baosheng Yu; Lu Dong; Zhipeng Gui

SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection

Shuxian Zhao, Jie Gui, Baosheng Yu, Lu Dong, Zhipeng Gui

Abstract

Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.

SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection

Abstract

Paper Structure (15 sections, 2 equations, 7 figures, 5 tables)

This paper contains 15 sections, 2 equations, 7 figures, 5 tables.

Introduction
Related Work
Steel Surface Defect Datasets
Vision-Language Models
Dataset
Dataset Construction
Dataset Statistics
Applications and Limitations
Benchmark and Experiments
Vision-Only Classification (Task 1)
Vision-Language Classification (Task 2)
Zero-/Few-Shot Recognition (Task 3)
Zero-shot Transfer (Task 4)
Discussion
Conclusion

Figures (7)

Figure 1: Illustration of different textual descriptions for steel surface defects. The figure shows the progression from simple class-name templates to coarse class-level descriptions that capture the semantic characteristics of defect types, representative visual patterns, and potential causes, and finally to fine-grained sample-level descriptions that provide detailed visual and semantic information.
Figure 2: Illustration of coarse-to-fine textual annotations in SteelDefectX. (a) Class-level: Each defect category is described by three semantic components: defect class name, representative visual attributes, and possible industrial causes, providing global contextual semantics. (b) Sample-level: Step 1: Candidate Generation using open-ended prompt $P_a$ to generate diverse descriptions via GPT-4o. Step 2: Candidate Refinement applying diversity-based filtering and dimension-aware scoring across five semantic aspects (shape, size, depth, position, contrast). Step 3: Candidate Supplement using structured prompt $P_b$ when dimensional coverage is insufficient. Step 4: Manual Correction for quality assurance.
Figure 3: Class distribution of SteelDefectX. The dataset exhibits an imbalanced distribution across 25 defect categories, with sample counts following a log-normal trend. The average number of samples in the dataset is 311. Common defects such as inclusion and water spot dominate the dataset, whereas rare defects (e.g., crease and rolled pit) are underrepresented, reflecting real-world variability in steel surface inspection scenarios.
Figure 4: t-SNE visualization of pixel-level features, illustrating intra-class variation and inter-class overlap among defect categories.
Figure 5: (a) Text length distribution of fine-grained descriptions in SteelDefectX. The distribution centers around 55 words with moderate variance, indicating concise yet sufficiently detailed annotations. (b) Vocabulary diversity across samples, measured by counting unique non-stop words using a TF-IDF representation, reflecting the lexical richness and variation within the dataset.
...and 2 more figures

SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection

Abstract

SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection

Authors

Abstract

Table of Contents

Figures (7)