LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models
Ci-Siang Lin, Min-Hung Chen, Yu-Yang Sheng, Yu-Chiang Frank Wang
TL;DR
LEAML addresses the challenge of adapting multimodal large language models to out-of-distribution visual QA tasks with limited annotations by introducing a two-stage, label-efficient workflow. It first trains a QA Generator on scarce labeled data and uses caption distillation from a large MLLM to produce high-quality pseudo QA pairs for abundant unlabeled images, then fine-tunes the VQA model on both real and pseudo data. A key innovation is Selective Neuron Distillation, which updates only QA-relevant neurons based on gradient-based importance, enabling domain-specific knowledge transfer while preserving general generation abilities. Empirically, LEAML yields substantial gains on GI endoscopy and sports VQA benchmarks under 1% labeled supervision, outperforming standard fine-tuning and confirming the value of targeted distillation and pseudo-labeling for domain adaptation in MLLMs.
Abstract
Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.
