Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning
Ivona Najdenkoska, Xiantong Zhen, Marcel Worring
TL;DR
The paper tackles multimodal few-shot learning by bridging vision and language through a meta-learning framework that employs a lightweight meta-mapper to generate a visual prefix for a frozen language model. By training across sequences of multimodal tasks, the model accrues shared meta-knowledge that enables rapid adaptation with few gradient steps, without hand-engineered task inductions. Experiments on cross-domain and in-domain benchmarks show the approach outperforms Frozen baselines while using far fewer trainable parameters and computational resources. The work demonstrates the value of meta-knowledge accumulation for flexible, data-efficient multimodal understanding and reasoning, and points to future work in expanding to additional modalities.
Abstract
Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered task induction to reduce the hypothesis space. To make the whole process learnable, we introduce a multimodal meta-learning approach. Specifically, our approach decomposes the training of the model into a set of related multimodal few-shot tasks. We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and language models and leverage their already learned capacity. By updating the learnable parameters only of the meta-mapper, it learns to accrue shared meta-knowledge among these tasks. Thus, it can rapidly adapt to newly presented samples with only a few gradient updates. Importantly, it induces the task in a completely data-driven manner, with no need for a hand-engineered task induction. We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words and answer visual questions by observing only a limited set of labeled examples. The experimental results show that our meta-learning approach outperforms the baseline across multiple datasets and various training settings while being computationally more efficient.
