Table of Contents
Fetching ...

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Oscar Mañas, Pau Rodriguez, Saba Ahmadi, Aida Nematzadeh, Yash Goyal, Aishwarya Agrawal

TL;DR

MAPL tackles multimodal vision-language tasks by reusing frozen unimodal foundations and learning a compact cross-modal mapping. It trains only a 3.4M-parameter mapping network to align CLIP-like visual features with GPT-J token embeddings, enabling zero- and few-shot transfer with small data and modest compute. Across VQA and image captioning benchmarks, MAPL achieves superior or competitive results while using orders of magnitude fewer trainable parameters than baselines like Frozen or Flamingo. The work demonstrates strong potential for low-data and in-domain VL adaptation, highlighting data quality and model portability as key factors in practical, resource-efficient multimodal systems.

Abstract

Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL's modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/mair-lab/mapl.

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

TL;DR

MAPL tackles multimodal vision-language tasks by reusing frozen unimodal foundations and learning a compact cross-modal mapping. It trains only a 3.4M-parameter mapping network to align CLIP-like visual features with GPT-J token embeddings, enabling zero- and few-shot transfer with small data and modest compute. Across VQA and image captioning benchmarks, MAPL achieves superior or competitive results while using orders of magnitude fewer trainable parameters than baselines like Frozen or Flamingo. The work demonstrates strong potential for low-data and in-domain VL adaptation, highlighting data quality and model portability as key factors in practical, resource-efficient multimodal systems.

Abstract

Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL's modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/mair-lab/mapl.
Paper Structure (40 sections, 21 figures, 6 tables)

This paper contains 40 sections, 21 figures, 6 tables.

Figures (21)

  • Figure 1: MAPL leverages a pre-trained vision encoder and a pre-trained LM, and learns a small mapping network to convert visual features into token embeddings. During training, only the mapping network is updated, keeping the vision encoder and the LM frozen (red arrows indicate gradient flow). At inference time, the system can take as input an arbitrary sequence of interleaved images and text, and generates free-form text as output.
  • Figure 2: The mapping network takes a flattened grid of $L_i$ visual features of dimension $D_i$ each from the vision encoder and transforms it into a sequence of token embeddings of length $L_o$ and dimension $D_o$, where $D_o$ is the token embedding dimension of the LM. Note that the parameters are shared across fully-connected (FC) layers, on both sides of the encoder transformer.
  • Figure 3: Qualitative samples from the web using MAPL$_\text{CC-clean}$. (Multimodal) input is in gray, and MAPL's output is in green (success) or red (failure).
  • Figure 4: MAPL's image captioning on Conceptual Captions.
  • Figure 5: MAPL's image captioning on COCO Captions.
  • ...and 16 more figures