Table of Contents
Fetching ...

LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

Ci-Siang Lin, Min-Hung Chen, Yu-Yang Sheng, Yu-Chiang Frank Wang

TL;DR

LEAML addresses the challenge of adapting multimodal large language models to out-of-distribution visual QA tasks with limited annotations by introducing a two-stage, label-efficient workflow. It first trains a QA Generator on scarce labeled data and uses caption distillation from a large MLLM to produce high-quality pseudo QA pairs for abundant unlabeled images, then fine-tunes the VQA model on both real and pseudo data. A key innovation is Selective Neuron Distillation, which updates only QA-relevant neurons based on gradient-based importance, enabling domain-specific knowledge transfer while preserving general generation abilities. Empirically, LEAML yields substantial gains on GI endoscopy and sports VQA benchmarks under 1% labeled supervision, outperforming standard fine-tuning and confirming the value of targeted distillation and pseudo-labeling for domain adaptation in MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.

LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

TL;DR

LEAML addresses the challenge of adapting multimodal large language models to out-of-distribution visual QA tasks with limited annotations by introducing a two-stage, label-efficient workflow. It first trains a QA Generator on scarce labeled data and uses caption distillation from a large MLLM to produce high-quality pseudo QA pairs for abundant unlabeled images, then fine-tunes the VQA model on both real and pseudo data. A key innovation is Selective Neuron Distillation, which updates only QA-relevant neurons based on gradient-based importance, enabling domain-specific knowledge transfer while preserving general generation abilities. Empirically, LEAML yields substantial gains on GI endoscopy and sports VQA benchmarks under 1% labeled supervision, outperforming standard fine-tuning and confirming the value of targeted distillation and pseudo-labeling for domain adaptation in MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.

Paper Structure

This paper contains 23 sections, 6 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of the proposed two-stage LEAML framework for OOD VQA adaptation. In Pseudo QA Generation, the QA Generator is trained using a small set of labeled question-answer pairs and then used to generate pseudo QA pairs for a large collection of unlabeled images. In OOD VQA Finetuning, the VQA model is fine-tuned with both the original labeled data and the produced pseudo QA pairs of unlabeled data, enabling label-efficient adaptation to out-of-distribution visual-question answering. We will detail the learning of our QA Generator in Figure \ref{['figure:model2']}.
  • Figure 2: Illustration of our Selective Neuron Distillation for the QA Generator. The QA-relevant parameters are first selected based on gradient scores from labeled QA data. During training, only these selected parameters are updated using auxiliary caption supervision from unlabeled images, allowing QA-related knowledge distillation for the QA Generator.
  • Figure 3: Qualitative results on the Kvasir-VQA dataset.
  • Figure 4: Qualitative results on the SPORTU dataset.
  • Figure 5: Qualitative results on the Kvasir-VQA dataset.
  • ...and 7 more figures