Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis

Sakhinana Sagar Srinivas; Geethan Sannidhi; Venkataramana Runkana

Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis

Sakhinana Sagar Srinivas, Geethan Sannidhi, Venkataramana Runkana

TL;DR

This study presents a secure, cost-effective, and customizable approach for analyzing microscopy images, addressing the challenges of adopting proprietary models in semiconductor manufacturing.

Abstract

We present a novel framework for analyzing and interpreting electron microscopy images in semiconductor manufacturing using vision-language instruction tuning. The framework employs a unique teacher-student approach, leveraging pre-trained multimodal large language models such as GPT-4 to generate instruction-following data for zero-shot visual question answering (VQA) and classification tasks, customizing smaller multimodal models (SMMs) for microscopy image analysis, resulting in an instruction-tuned language-and-vision assistant. Our framework merges knowledge engineering with machine learning to integrate domain-specific expertise from larger to smaller multimodal models within this specialized field, greatly reducing the need for extensive human labeling. Our study presents a secure, cost-effective, and customizable approach for analyzing microscopy images, addressing the challenges of adopting proprietary models in semiconductor manufacturing.

Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis

TL;DR

This study presents a secure, cost-effective, and customizable approach for analyzing microscopy images, addressing the challenges of adopting proprietary models in semiconductor manufacturing.

Abstract

Paper Structure (12 sections, 3 equations, 3 figures, 7 tables)

This paper contains 12 sections, 3 equations, 3 figures, 7 tables.

Introduction
Proposed Method
Instruction-tuned teacher LMM:
Multimodal Instruction-Following Data:
Model Architecture:
Experiments And Results
Datasets:
Experimental Setup:
VQA Results:
Image Classification Results:
Ablation Study:
Conclusion

Figures (3)

Figure 1: The architecture and objectives of MVaEMa, our proposed multimodal deep learning framework for VQA task in nanomaterial image analysis, are presented. The schematic illustrates a small-scale multimodal architecture that integrates text and image data, which is trained using vision-language instruction tuning, utilizing instruction-following data generated by the instruction-tuned GPT-4 Turbo with Vision. The architecture consists of an image encoder, a text encoder, and an image-grounded text-encoder and text-decoder, each containing self-attention and feed-forward layers. The framework is optimized using a combination of image-text contrastive, binary cross-entropy, and language modeling loss functions, aiming to align the multimodal representations to generate text output that answers questions about the image, showcasing the framework's ability to interpret and articulate complex intermodal relationships.
Figure 2: The figure shows the challenges in VQA task on electron micrographs in the SEM dataset aversa2018first.
Figure 3: The figure displays nanomaterials from the SEM dataset. From left to right in the first, second, and third rows, we have: biological, fibers, films, MEMS; nanowires, particles, patterned surface, porous sponges; and powder, tips.

Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis

TL;DR

Abstract

Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (3)