Table of Contents
Fetching ...

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Ivona Najdenkoska, Xiantong Zhen, Marcel Worring

TL;DR

The paper tackles multimodal few-shot learning by bridging vision and language through a meta-learning framework that employs a lightweight meta-mapper to generate a visual prefix for a frozen language model. By training across sequences of multimodal tasks, the model accrues shared meta-knowledge that enables rapid adaptation with few gradient steps, without hand-engineered task inductions. Experiments on cross-domain and in-domain benchmarks show the approach outperforms Frozen baselines while using far fewer trainable parameters and computational resources. The work demonstrates the value of meta-knowledge accumulation for flexible, data-efficient multimodal understanding and reasoning, and points to future work in expanding to additional modalities.

Abstract

Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered task induction to reduce the hypothesis space. To make the whole process learnable, we introduce a multimodal meta-learning approach. Specifically, our approach decomposes the training of the model into a set of related multimodal few-shot tasks. We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and language models and leverage their already learned capacity. By updating the learnable parameters only of the meta-mapper, it learns to accrue shared meta-knowledge among these tasks. Thus, it can rapidly adapt to newly presented samples with only a few gradient updates. Importantly, it induces the task in a completely data-driven manner, with no need for a hand-engineered task induction. We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words and answer visual questions by observing only a limited set of labeled examples. The experimental results show that our meta-learning approach outperforms the baseline across multiple datasets and various training settings while being computationally more efficient.

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

TL;DR

The paper tackles multimodal few-shot learning by bridging vision and language through a meta-learning framework that employs a lightweight meta-mapper to generate a visual prefix for a frozen language model. By training across sequences of multimodal tasks, the model accrues shared meta-knowledge that enables rapid adaptation with few gradient steps, without hand-engineered task inductions. Experiments on cross-domain and in-domain benchmarks show the approach outperforms Frozen baselines while using far fewer trainable parameters and computational resources. The work demonstrates the value of meta-knowledge accumulation for flexible, data-efficient multimodal understanding and reasoning, and points to future work in expanding to additional modalities.

Abstract

Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered task induction to reduce the hypothesis space. To make the whole process learnable, we introduce a multimodal meta-learning approach. Specifically, our approach decomposes the training of the model into a set of related multimodal few-shot tasks. We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and language models and leverage their already learned capacity. By updating the learnable parameters only of the meta-mapper, it learns to accrue shared meta-knowledge among these tasks. Thus, it can rapidly adapt to newly presented samples with only a few gradient updates. Importantly, it induces the task in a completely data-driven manner, with no need for a hand-engineered task induction. We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words and answer visual questions by observing only a limited set of labeled examples. The experimental results show that our meta-learning approach outperforms the baseline across multiple datasets and various training settings while being computationally more efficient.
Paper Structure (27 sections, 5 equations, 4 figures, 6 tables)

This paper contains 27 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Multimodal few-shot meta-learning task for an example of a 2-way 1-shot setting, with two categories (ways) present in the support set images, each represented with one sample (shot). Given a batch of tasks $\mathcal{T}_i$, the support set is first used to obtain task-specific model parameters $\theta_i^{'}$ for each task by few gradient-step updates, which are then used together with the query set samples to perform a meta-update step for updating the meta-parameters $\theta$. After the meta-training is finished, for a new given task, the meta-trained model is used for inference by further adapting the meta-learned model with the support set, and measuring the performance on unseen query samples.
  • Figure 2: The architecture of the multimodal meta few-shot learner. It consists of three parts: frozen vision encoder $v_{\phi}$; frozen language model with a text embedder $g_{\psi}$ and a generator $g_{\omega}$; and a meta-mapper $f_\theta$ with trainable meta-parameters $\theta$. In the example shown, the model is generating the last word retriever, in an autoregressive manner.
  • Figure 3: Relationship between consecutive steps of gradient-based updates and the accuracy on Real-Name miniImageNet on 5-way tasks, performed during the meta-test stage.
  • Figure 4: Qualitative examples of query set images from Real-Name miniImageNet (first two) and Real-Fast VQA (last two), with their question, ground-truth and answers generated by our model.