Parameter-Efficient Quantized Mixture-of-Experts Meets Vision-Language Instruction Tuning for Semiconductor Electron Micrograph Analysis

Sakhinana Sagar Srinivas; Chidaksh Ravuru; Geethan Sannidhi; Venkataramana Runkana

Parameter-Efficient Quantized Mixture-of-Experts Meets Vision-Language Instruction Tuning for Semiconductor Electron Micrograph Analysis

Sakhinana Sagar Srinivas, Chidaksh Ravuru, Geethan Sannidhi, Venkataramana Runkana

TL;DR

This work presents sLAVA, a small-scale, vision-language assistant tailored for semiconductor electron micrograph analysis, built via a teacher–student paradigm that leverages GPT-4 to generate instruction-following data for on-premise fine-tuning on consumer hardware. Central to the approach is Dynamic Adaptation of MoPEs (DyA-MoPEs), which combines quantized Mixture-of-Experts with parameter-efficient fine-tuning to enable flexible, memory-conscious adaptation of Llama-2-7B for VQA, image captioning, and zero-/few-shot classification in microscopy domains. The authors demonstrate that sLAVA achieves state-of-the-art or competitive performance against larger proprietary models while preserving data privacy, and they validate robustness across diverse SEM datasets (including aversa2018first, NEU-SDD, CMI, and KTH-TIPS) and through extensive ablations on loss functions, data generation, and sampling strategies. The practical impact lies in enabling enterprises to deploy domain-specific, privacy-preserving vision–language tools for micrograph analysis, accelerating semiconductor development while reducing reliance on external LMMs. Overall, the work advances parameter-efficient, on-prem vision–language modeling for specialized microscopy tasks with strong empirical support across multiple open datasets.

Abstract

Semiconductors, crucial to modern electronics, are generally under-researched in foundational models. It highlights the need for research to enhance the semiconductor device technology portfolio and aid in high-end device fabrication. In this paper, we introduce sLAVA, a small-scale vision-language assistant tailored for semiconductor manufacturing, with a focus on electron microscopy image analysis. It addresses challenges of data scarcity and acquiring high-quality, expert-annotated data. We employ a teacher-student paradigm, using a foundational vision language model like GPT-4 as a teacher to create instruction-following multimodal data for customizing the student model, sLAVA, for electron microscopic image analysis tasks on consumer hardware with limited budgets. Our approach allows enterprises to further fine-tune the proposed framework with their proprietary data securely within their own infrastructure, protecting intellectual property. Rigorous experiments validate that our framework surpasses traditional methods, handles data shifts, and enables high-throughput screening.

Parameter-Efficient Quantized Mixture-of-Experts Meets Vision-Language Instruction Tuning for Semiconductor Electron Micrograph Analysis

TL;DR

Abstract

Paper Structure (28 sections, 5 equations, 11 figures, 15 tables)

This paper contains 28 sections, 5 equations, 11 figures, 15 tables.

Introduction
Experiments And Results
Datasets
Experimental Studies
Results
Conclusion
Technical Appendix
Dynamic Adaptation of Mixture of Quantized Parameter-Efficient Experts (DyA-MoQPEs)
Fine-Tuning, Pretrained Large Language Models(LLMs)
Generation of MultiModal Instruction-Tuning Data
Sampling Strategies for Instruction Tuning Dataset Generation
Loss Functions
Image-Text Matching loss (ITM)
Language modeling loss (LM)
Additional Information
...and 13 more sections

Figures (11)

Figure 1: Challenges in Visual Question Answering (VQA) task on electron micrographs from the SEM dataset aversa2018first.
Figure 2: The schematic illustrates a variant of sLAVA, a small-scale, visually conditioned, autoregressive text generation model that takes prompts combining visual and textual information as input and outputs free-form text for the image captioning task. The input multimodal prompt includes a microscopic image combined with a supplementary user-provided image description, along with the end-user's question. In this zero-shot setting, the task is to answer the question about the microscopic image solely based on the small-scale model's internal parametric knowledge. sLAVA comprises a vision encoder to capture the global context of microscopic images, and a text encoder that interprets end-user questions and the auxiliary user-provided image information. The image-grounded text encoder facilitates cross-modal learning by integrating visual information directly into text understanding, thereby generating a comprehensive multimodal representation grounded in the image's visual content. The image-grounded text decoder then synthesizes coherent and contextually relevant textual outputs based on the generated multimodal representations. Finally, the framework is jointly optimized using the binary cross-entropy loss for positive image-text matching and language modeling loss for contextually relevant text generation to answer end-user questions.
Figure 3: The figure shows SEM imagesaversa2018first showcasing diverse nanomaterial morphologies. Top row: biological structures, fibers, films, MEMS devices, nanowires. Bottom row: nanoparticles, patterned surfaces, porous sponges, powders, tips.
Figure 4: The schematic depicts a variant of sLAVA (small-scale, language-and-vision assistant), a family of visually-conditioned, autoregressive text generation model. The small-scale vision-and-language model take as input a multimodal prompt consisting of the target electron micrographs and user-provided auxiliary text, along with the user question. The model then generates free-form text to answer end-user questions. The task is to categorize the image into one of ten categories, such as biological fibers and films, in a zero-shot setting.
Figure 5: The schematic depicts a variant of sLAVA, a small-scale language-and-vision assistant. It takes a multimodal prompt consisting of electron micrographs, interspersed arbitrarily with text, as input and generates free-form text as output. The input consists of a series of electron microscopy images, their corresponding ground-truth labels, and a task-specific instruction. In a few-shot setting, the objective is to predict the label for the target image.
...and 6 more figures

Parameter-Efficient Quantized Mixture-of-Experts Meets Vision-Language Instruction Tuning for Semiconductor Electron Micrograph Analysis

TL;DR

Abstract

Parameter-Efficient Quantized Mixture-of-Experts Meets Vision-Language Instruction Tuning for Semiconductor Electron Micrograph Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (11)