Parameter-Efficient Quantized Mixture-of-Experts Meets Vision-Language Instruction Tuning for Semiconductor Electron Micrograph Analysis
Sakhinana Sagar Srinivas, Chidaksh Ravuru, Geethan Sannidhi, Venkataramana Runkana
TL;DR
This work presents sLAVA, a small-scale, vision-language assistant tailored for semiconductor electron micrograph analysis, built via a teacher–student paradigm that leverages GPT-4 to generate instruction-following data for on-premise fine-tuning on consumer hardware. Central to the approach is Dynamic Adaptation of MoPEs (DyA-MoPEs), which combines quantized Mixture-of-Experts with parameter-efficient fine-tuning to enable flexible, memory-conscious adaptation of Llama-2-7B for VQA, image captioning, and zero-/few-shot classification in microscopy domains. The authors demonstrate that sLAVA achieves state-of-the-art or competitive performance against larger proprietary models while preserving data privacy, and they validate robustness across diverse SEM datasets (including aversa2018first, NEU-SDD, CMI, and KTH-TIPS) and through extensive ablations on loss functions, data generation, and sampling strategies. The practical impact lies in enabling enterprises to deploy domain-specific, privacy-preserving vision–language tools for micrograph analysis, accelerating semiconductor development while reducing reliance on external LMMs. Overall, the work advances parameter-efficient, on-prem vision–language modeling for specialized microscopy tasks with strong empirical support across multiple open datasets.
Abstract
Semiconductors, crucial to modern electronics, are generally under-researched in foundational models. It highlights the need for research to enhance the semiconductor device technology portfolio and aid in high-end device fabrication. In this paper, we introduce sLAVA, a small-scale vision-language assistant tailored for semiconductor manufacturing, with a focus on electron microscopy image analysis. It addresses challenges of data scarcity and acquiring high-quality, expert-annotated data. We employ a teacher-student paradigm, using a foundational vision language model like GPT-4 as a teacher to create instruction-following multimodal data for customizing the student model, sLAVA, for electron microscopic image analysis tasks on consumer hardware with limited budgets. Our approach allows enterprises to further fine-tune the proposed framework with their proprietary data securely within their own infrastructure, protecting intellectual property. Rigorous experiments validate that our framework surpasses traditional methods, handles data shifts, and enables high-throughput screening.
