Table of Contents
Fetching ...

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Taowen Wang, Yiyang Liu, James Chenhao Liang, junhan zhao, Yiming Cui, Yuning Mao, Shaoliang Nie, Jiahao Liu, Fuli Feng, Zenglin Xu, Cheng Han, Lifu Huang, Qifan Wang, Dongfang Liu

TL;DR

A novel Multimodal Prompt Tuning (M^2PT) approach for efficient instruction tuning of MLLMs, which effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (M$^2$PT) approach for efficient instruction tuning of MLLMs. M$^2$PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

TL;DR

A novel Multimodal Prompt Tuning (M^2PT) approach for efficient instruction tuning of MLLMs, which effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (MPT) approach for efficient instruction tuning of MLLMs. MPT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.
Paper Structure (35 sections, 5 equations, 10 figures, 5 tables)

This paper contains 35 sections, 5 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison of M$^2$PT and several PEFT methods, including LoRA hu2021lora, PTUM PTUMPM and VPT han2024facing, on multimodal tasks. Our approach exhibits superior performance across a range of benchmarks.
  • Figure 2: Overview of our M$^2$PT approach. Here, visual prompts are embedded into each layer of the Visual Encoder, and textual prompts are embedded into each layer of the LLM. These prompts facilitate the extraction and alignment of features across modalities (e.g., vision, language). The cross-modality interaction between visual and textual features is enhanced through layered integration, ultimately improving the model's capability in zero-shot instruction learning tasks (see §\ref{['sec:exp']}).
  • Figure 3: Comprehensive visualization of attention activation maps. This figure presents a detailed examination of the activation patterns within the last layer of LLM and Visual Encoder, respectively. As seen, the vision prompts and textual prompts have noticeably high activation levels during inference (i.e., $\bullet$ and $\bullet$ represent textual prompts' activation signal and visual prompts' activation signal, respectively).
  • Figure 4: Impact of Different Components.
  • Figure 5: Performance of Different Prompt Length. Each cell in the map corresponds to the score of a model with a textual prompt length (row) and a visual prompt length. A darker hue indicates a higher score, whereas a lighter hue signifies a lower score.
  • ...and 5 more figures