Foundational Model for Electron Micrograph Analysis: Instruction-Tuning Small-Scale Language-and-Vision Assistant for Enterprise Adoption

Sakhinana Sagar Srinivas; Chidaksh Ravuru; Geethan Sannidhi; Venkataramana Runkana

Foundational Model for Electron Micrograph Analysis: Instruction-Tuning Small-Scale Language-and-Vision Assistant for Enterprise Adoption

Sakhinana Sagar Srinivas, Chidaksh Ravuru, Geethan Sannidhi, Venkataramana Runkana

TL;DR

The paper tackles the challenge of interpreting high-resolution electron micrographs in semiconductor manufacturing where labeled data are scarce. It introduces MAEMI, a small-scale vision-language assistant trained through instruction tuning on synthetic data generated by large multimodal models and distilled from large models to compact, open-source backbones, enabling on-premises deployment with strong privacy. Key technical contributions include dynamic low-rank adapters (DyLoRA-FA), weight-only quantization (WOQ), and a data-generation pipeline using GPT-4 Turbo with Vision to create image-question-answer triplets from the SEM dataset, supporting zero-/few-shot VQA and image captioning. Empirical results show MAEMI achieves superior or competitive performance against baselines on captioning and VQA tasks, generalizes to open-source material datasets, and offers a practical path for enterprise adoption with low-cost hardware and on-site data privacy. Overall, the work advances practical, privacy-preserving multimodal analysis of electron micrographs with potential impact on quality control and process optimization in semiconductor manufacturing.

Abstract

Semiconductor imaging and analysis are critical yet understudied in deep learning, limiting our ability for precise control and optimization in semiconductor manufacturing. We introduce a small-scale multimodal framework for analyzing semiconductor electron microscopy images (MAEMI) through vision-language instruction tuning. We generate a customized instruction-following dataset using large multimodal models on microscopic image analysis. We perform knowledge transfer from larger to smaller models through knowledge distillation, resulting in improved accuracy of smaller models on visual question answering (VQA) tasks. This approach eliminates the need for expensive, human expert-annotated datasets for microscopic image analysis tasks. Enterprises can further finetune MAEMI on their intellectual data, enhancing privacy and performance on low-cost consumer hardware. Our experiments show that MAEMI outperforms traditional methods, adapts to data distribution shifts, and supports high-throughput screening.

Foundational Model for Electron Micrograph Analysis: Instruction-Tuning Small-Scale Language-and-Vision Assistant for Enterprise Adoption

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 11 figures, 15 tables)

This paper contains 25 sections, 6 equations, 11 figures, 15 tables.

Introduction
Experiments And Results
Datasets
Experimental Studies
Results
Conclusion
Technical Appendix
Dynamic Low-Rank Adaptation with Activation Memory Reduction (DyQLoRA-FA)
Fine-Tuning, Pretrained Large Language Models(LLMs)
Pretrained Large Multimodal Models
Multimodal Instruction-Following Data
Vision Encoder
Sampling Strategies for Instruction Tuning Dataset Generation
Additional Information
Experimental Setup
...and 10 more sections

Figures (11)

Figure 1: Challenges in analyzing electron micrographs from the SEM dataset aversa2018first.
Figure 2: The schematic illustrates MAEMI, a small-scale, autoregressive text generation model. It takes as input a multimodal prompt consisting of the target image interleaved with auxiliary image descriptions and captioning instructions (or end-user questions), and outputs visually grounded descriptive text in a zero-shot setting. MAEMI utilizes a vision transformer and a pre-trained language model to analyze images and interpret questions about them. Both encoders synergize through a multi-layer structure of alternating gated cross-attention and self-attention blocks, effectively integrating both modalities – visual and textual information – to generate accurate and contextually relevant answers. The framework is trained in a supervised learning setting using a vision-language instruction tuning dataset to generate answers that are grounded in visual information and aligned with the target image content.
Figure 3: The figure shows representative microscopic images of diverse nanomaterials: biological structures, fibers, films, MEMS devices, nanowires (top); nanoparticles, patterned surfaces, porous sponges, powders, tips (bottom). (Source: aversa2018first)
Figure 4: The schematic illustrates the small-scale, multimodal assistant for electron micrograph analysis (MAEMI), a content-aware, visually-conditioned, autoregressive text generation model that takes a multimodal prompt containing electron micrographs interleaved with textual descriptions, and produces free-form text as output. The input consists of a target image, user-provided supplementary text, and task-specific instruction. The goal is to categorize the image into one of ten categories in a zero-shot setting.
Figure 5: The schematic illustrates a small-scale, multimodal assistant for electron micrograph analysis (MAEMI), a visually-conditioned, autoregressive text generation model. The multimodal input conisits of microscopic images arbitrarily interleaved with textual descriptions and produces free-form text as output. The input includes a few demonstration examples as input-output mappings(microscopic images their corresponding labels), and a task-specific instruction. The goal is to predict the label for the target image in a few-shot setting.
...and 6 more figures

Foundational Model for Electron Micrograph Analysis: Instruction-Tuning Small-Scale Language-and-Vision Assistant for Enterprise Adoption

TL;DR

Abstract

Foundational Model for Electron Micrograph Analysis: Instruction-Tuning Small-Scale Language-and-Vision Assistant for Enterprise Adoption

Authors

TL;DR

Abstract

Table of Contents

Figures (11)