Table of Contents
Fetching ...

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

Akash Gupta, Amos Storkey, Mirella Lapata

TL;DR

This work proposes a meta-learning approach that induces few-shot capabilities in LMMs through a fixed set of soft prompts distilled from task-relevant visual features, which are adapted at test time using a small number of examples.

Abstract

Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new visual question answering (VQA) tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, does not always improve monotonically when increasing the number of examples. We hypothesize that this happens because the LMM is overwhelmed by extraneous information in the image embeddings that is irrelevant to the downstream task. To address this, we propose a meta-learning approach that induces few-shot capabilities in LMMs through a fixed set of soft prompts distilled from task-relevant visual features, which are adapted at test time using a small number of examples. We facilitate this distillation through an attention-mapper module that can be easily integrated with any LMM architecture and is jointly learned with soft prompts. Evaluation on the VL-ICL Bench shows that our method successfully achieves task adaptation in low-data regimes with just a few gradient steps, outperforming ICL by 21.2%. Comparisons with parameter-efficient finetuning methods demonstrate that meta-learning further enhances this adaptation by 7.7% for various VQA tasks.

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

TL;DR

This work proposes a meta-learning approach that induces few-shot capabilities in LMMs through a fixed set of soft prompts distilled from task-relevant visual features, which are adapted at test time using a small number of examples.

Abstract

Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new visual question answering (VQA) tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, does not always improve monotonically when increasing the number of examples. We hypothesize that this happens because the LMM is overwhelmed by extraneous information in the image embeddings that is irrelevant to the downstream task. To address this, we propose a meta-learning approach that induces few-shot capabilities in LMMs through a fixed set of soft prompts distilled from task-relevant visual features, which are adapted at test time using a small number of examples. We facilitate this distillation through an attention-mapper module that can be easily integrated with any LMM architecture and is jointly learned with soft prompts. Evaluation on the VL-ICL Bench shows that our method successfully achieves task adaptation in low-data regimes with just a few gradient steps, outperforming ICL by 21.2%. Comparisons with parameter-efficient finetuning methods demonstrate that meta-learning further enhances this adaptation by 7.7% for various VQA tasks.

Paper Structure

This paper contains 35 sections, 12 equations, 22 figures, 18 tables, 1 algorithm.

Figures (22)

  • Figure 1: Failure case of LLaVA-OneVision-7B li2025llavaonevision on an example from the Fast Open-Ended MiniImageNet classification task NEURIPS2021_01b7575c. When no in-context examples are provided (0-shot), the model generates a generic description of the image. As more examples (shots) are added, it begins to learn the answer format (single word), but still fails to grasp the task, producing incorrect or irrelevant predictions. We only show the in-context examples (left) for 2-way 1-shot setting for the sake of brevity but provide model predictions (in red) for up to 5 shots.
  • Figure 2: I2T and T2T performance with LLaVA-OneVision-7B on Operator Induction and CLEVR Count Induction tasks.
  • Figure 3: Our proposed MAPD framework based on LLaVA v1.5-7B Liu_2024_CVPR: image embeddings are distilled into soft prompts $P$ during instruction finetuning. The support set $(X_v^{\text{supp}}, X_q^{\text{supp}}, X_a^{\text{supp}})$ is processed initially to the obtain loss value $L_{\text{supp}}$ which is used in the inner-loop to obtain task-specific parameters $\{\theta', P'\}$. Next, the query set $(X_v^\text{query}, X_q^{\text{query}}, X_a^{\text{query}})$ is used to calculate the query loss for the outer-loop meta-parameter optimization $\{\theta, P\}$.
  • Figure 4: (a) Left: Performance comparison between MAPD+FT (M) and In-ContextPD+ICL (I). Mean Accuracy is computed across all VL-ICL datasets. We consider different prompt token lengths $P=\{4,16,64,256\}$ which are shown in $\log_2(P)$ scale for different shots. (b) Right: Performance of different prompt distillation methods on three Operator Induction subtasks: Task Induction, Perception, and Math Reasoning. We report mean exact-match (EM; $\%$) for 1,2 and 8-shots as defined in the VL-ICL Bench zong2025vlicl except for Mathematical Reasoning, which uses mean ratings generated by Qwen-2.5-VL-32B-Instruct. More details can be found in Appendix \ref{['sec_app:op_abls']}
  • Figure 5: Projection layer architectures in the base LMM. SP: Soft Prompts, ATT: Attention-Mapper, MLP: 2-layer MLP (originally used in LLaVA v1.5).
  • ...and 17 more figures