Table of Contents
Fetching ...

Progressive Alignment with VLM-LLM Feature to Augment Defect Classification for the ASE Dataset

Chih-Chung Hsu, Chia-Ming Lee, Chun-Hung Sun, Kuang-Ming Wu

TL;DR

This work tackles defect classification under data-scarce and low-visual-information conditions by leveraging zero-shot VLM-LLM prompting to extract external-modality cues from images and associated descriptions. It introduces the ASE dataset, rich with image and textual/numeric context, and proposes a three-pronged framework: (1) prompting VLM-LLM to obtain cross-modal descriptors, (2) Progressive Feature Alignment to iteratively align image-text representations with a contrastive objective, and (3) Cross-modality Attention Fusion to adaptively fuse modalities for robust classification. A Task-specific Data Augmentation strategy further enlarges the training signal without fine-tuning, improving tail-class performance. Empirical results on ASE demonstrate superior macro-F1 scores compared with strong baselines and several few-shot methods, highlighting the practical potential of multimodal augmentation for industrial defect recognition in data-constrained settings.

Abstract

Traditional defect classification approaches are facing with two barriers. (1) Insufficient training data and unstable data quality. Collecting sufficient defective sample is expensive and time-costing, consequently leading to dataset variance. It introduces the difficulty on recognition and learning. (2) Over-dependence on visual modality. When the image pattern and texture is monotonic for all defect classes in a given dataset, the performance of conventional AOI system cannot be guaranteed. In scenarios where image quality is compromised due to mechanical failures or when defect information is inherently difficult to discern, the performance of deep models cannot be guaranteed. A main question is, "how to solve those two problems when they occur at the same time?" The feasible strategy is to explore another feature within dataset and combine an eminent vision-language model (VLM) and Large-Language model (LLM) with their astonishing zero-shot capability. In this work, we propose the special ASE dataset, including rich data description recorded on image, for defect classification, but the defect feature is uneasy to learn directly. Secondly, We present the prompting for VLM-LLM against defect classification with the proposed ASE dataset to activate extra-modality feature from images to enhance performance. Then, We design the novel progressive feature alignment (PFA) block to refine image-text feature to alleviate the difficulty of alignment under few-shot scenario. Finally, the proposed Cross-modality attention fusion (CMAF) module can effectively fuse different modality feature. Experiment results have demonstrated our method's effectiveness over several defect classification methods for the ASE dataset.

Progressive Alignment with VLM-LLM Feature to Augment Defect Classification for the ASE Dataset

TL;DR

This work tackles defect classification under data-scarce and low-visual-information conditions by leveraging zero-shot VLM-LLM prompting to extract external-modality cues from images and associated descriptions. It introduces the ASE dataset, rich with image and textual/numeric context, and proposes a three-pronged framework: (1) prompting VLM-LLM to obtain cross-modal descriptors, (2) Progressive Feature Alignment to iteratively align image-text representations with a contrastive objective, and (3) Cross-modality Attention Fusion to adaptively fuse modalities for robust classification. A Task-specific Data Augmentation strategy further enlarges the training signal without fine-tuning, improving tail-class performance. Empirical results on ASE demonstrate superior macro-F1 scores compared with strong baselines and several few-shot methods, highlighting the practical potential of multimodal augmentation for industrial defect recognition in data-constrained settings.

Abstract

Traditional defect classification approaches are facing with two barriers. (1) Insufficient training data and unstable data quality. Collecting sufficient defective sample is expensive and time-costing, consequently leading to dataset variance. It introduces the difficulty on recognition and learning. (2) Over-dependence on visual modality. When the image pattern and texture is monotonic for all defect classes in a given dataset, the performance of conventional AOI system cannot be guaranteed. In scenarios where image quality is compromised due to mechanical failures or when defect information is inherently difficult to discern, the performance of deep models cannot be guaranteed. A main question is, "how to solve those two problems when they occur at the same time?" The feasible strategy is to explore another feature within dataset and combine an eminent vision-language model (VLM) and Large-Language model (LLM) with their astonishing zero-shot capability. In this work, we propose the special ASE dataset, including rich data description recorded on image, for defect classification, but the defect feature is uneasy to learn directly. Secondly, We present the prompting for VLM-LLM against defect classification with the proposed ASE dataset to activate extra-modality feature from images to enhance performance. Then, We design the novel progressive feature alignment (PFA) block to refine image-text feature to alleviate the difficulty of alignment under few-shot scenario. Finally, the proposed Cross-modality attention fusion (CMAF) module can effectively fuse different modality feature. Experiment results have demonstrated our method's effectiveness over several defect classification methods for the ASE dataset.
Paper Structure (17 sections, 12 equations, 8 figures, 5 tables)

This paper contains 17 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The brief illustration of proposed ASE data. It contains two parts: (1) AOI image, which formed by the large number of pink dot. (2) The recorded numeric and textual information corresponds to the pink-dot.
  • Figure 2: The brief summary of different classes within ASE dataset. The dense pink dots in the picture form a special pattern. This pattern can be described by VLM easily, and we can further combine it with LLM by suitable prompting engineering and our prior knowledge on ASE dataset to enhance the defect classification performance without any expensive fine-tuning for VLM or LLM.
  • Figure 3: The overall architecture of the proposed framework. It aims to incorporate with VLM-LLM to explore external-modality features to jointly learn better representations for defect classification. Through our proposed Progressive Feature Alignment (PFA) and Cross-Modality Attention Fusion (CMAF) module, textual and visual features are efficiently fused, effectively addressing the challenges and limitations commonly encountered by conventional deep learning approaches (e.g. CNN, ViT) when processing the ASE dataset.
  • Figure 4: The GradCAM++ 8354201 visualization for proposed ASE dataset. Left column: AOI images from ASE dataset; Middle column: using ResNet50 resnet; Right column: using DeiT touvron2020training.
  • Figure 5: The encoded feature vectors in low-dimensional space. Our PFA considers the self-similarity within every image-text pairs among all training dataset. By ranking their similarity, low self-similarity data pairs will be regarded as negative samples and added to ${D}_{train}$ with priority robinson2021contrastive. It aims to early align negative samples at first to alleviate the difficulty of convergence during alignment.
  • ...and 3 more figures