OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System

Chih-Chung Hsu; Chia-Ming Lee; Chun-Hung Sun; Kuang-Ming Wu

OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System

Chih-Chung Hsu, Chia-Ming Lee, Chun-Hung Sun, Kuang-Ming Wu

TL;DR

This paper addresses defect detection in Automatic Optical Inspection (AOI) under data scarcity and cross-domain variability. It proposes OCR-AOI-Net (OANet), a framework that mines OCR-derived external modality features from AOI images and aligns them with image representations through pre-fusion feature alignment, refinement, and a gate-based inference strategy to achieve robust fusion without requiring multi-modal paired data. The approach introduces a single-modality-aware multimodal learning paradigm that leverages in-image textual cues, along with a pre-fusion alignment and refinement mechanism and a gate function to maintain performance under OCR disturbances. Empirical results on an ASE Corporation AOI dataset show improved recall and robustness, suggesting practical benefits for industrial defect detection and reduction of false negatives in manufacturing QA.

Abstract

Automatic optical inspection (AOI) plays a pivotal role in the manufacturing process, predominantly leveraging high-resolution imaging instruments for scanning purposes. It detects anomalies by analyzing image textures or patterns, making it an essential tool in industrial manufacturing and quality control. Despite its importance, the deployment of models for AOI often faces challenges. These include limited sample sizes, which hinder effective feature learning, variations among source domains, and sensitivities to changes in lighting and camera positions during imaging. These factors collectively compromise the accuracy of model predictions. Traditional AOI often fails to capitalize on the rich mechanism-parameter information from machines or inside images, including statistical parameters, which typically benefit AOI classification. To address this, we introduce an external modality-guided data mining framework, primarily rooted in optical character recognition (OCR), to extract statistical features from images as a second modality to enhance performance, termed OANet (Ocr-Aoi-Net). A key aspect of our approach is the alignment of external modality features, extracted using a single modality-aware model, with image features encoded by a convolutional neural network. This synergy enables a more refined fusion of semantic representations from different modalities. We further introduce feature refinement and a gating function in our OANet to optimize the combination of these features, enhancing inference and decision-making capabilities. Experimental outcomes show that our methodology considerably boosts the recall rate of the defect detection model and maintains high robustness even in challenging scenarios.

OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System

TL;DR

Abstract

Paper Structure (21 sections, 3 figures, 2 tables)

This paper contains 21 sections, 3 figures, 2 tables.

Introduction
Related Work
Advanced Defect Detection
Multi-Modality Learning
Optical Character Recognition
Methodology
Overview
External Modality-guided Data Mining
Modality Feature Retrieval
External Modality Feature Intra-Bagging
Multi-Modal Feature Fusion
Image-based Encoder
Feature Alignment
Feature Refinement
Inference and Decision Making Procedure
...and 6 more sections

Figures (3)

Figure 1: The overall architecture of the proposed framework. It aims to explore external modality features to comprehensively support learning better semantic representations and decision-making. Several single modality features are extracted and fed into corresponding models, followed by feature alignment and refinement to enrich the feature space. Finally, we implement a gate function to aggregate all features and adaptively adjust the weights of different models or features, resulting in more robust and stable prediction results.
Figure 2: The concepts of feature alignment, refinement, and gate function are illustrated in the following figure. External single modality data is input into the trainable encoder branch to extract features, while image features are processed in another branch. After aligning the features extracted by different modality models, they are concatenated with the output from the previous stage. Feature refinement enhances features that encompass a richer representation. Finally, the gate function is conducted to comprehensively boost overall model performance against defect detection task.
Figure 3: Depicted in the left image is provided by ASE corporation. From these images, we can extract text or numerical data to gain another feature into their content. In the middle image, a grid mask is employed to introduce perturbations to areas of the image containing the target object, simulating a scenario of image corruption. The right image illustrates the perturbations to the text portion to simulate a situation where OCR may fail.

OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System

TL;DR

Abstract

OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System

Authors

TL;DR

Abstract

Table of Contents

Figures (3)