Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

Akash Gupta; Amos Storkey; Mirella Lapata

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

Akash Gupta, Amos Storkey, Mirella Lapata

TL;DR

This work proposes a meta-learning approach that induces few-shot capabilities in LMMs through a fixed set of soft prompts distilled from task-relevant visual features, which are adapted at test time using a small number of examples.

Abstract

Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new visual question answering (VQA) tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, does not always improve monotonically when increasing the number of examples. We hypothesize that this happens because the LMM is overwhelmed by extraneous information in the image embeddings that is irrelevant to the downstream task. To address this, we propose a meta-learning approach that induces few-shot capabilities in LMMs through a fixed set of soft prompts distilled from task-relevant visual features, which are adapted at test time using a small number of examples. We facilitate this distillation through an attention-mapper module that can be easily integrated with any LMM architecture and is jointly learned with soft prompts. Evaluation on the VL-ICL Bench shows that our method successfully achieves task adaptation in low-data regimes with just a few gradient steps, outperforming ICL by 21.2%. Comparisons with parameter-efficient finetuning methods demonstrate that meta-learning further enhances this adaptation by 7.7% for various VQA tasks.

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

TL;DR

Abstract

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)