M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Taowen Wang; Yiyang Liu; James Chenhao Liang; junhan zhao; Yiming Cui; Yuning Mao; Shaoliang Nie; Jiahao Liu; Fuli Feng; Zenglin Xu; Cheng Han; Lifu Huang; Qifan Wang; Dongfang Liu

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Taowen Wang, Yiyang Liu, James Chenhao Liang, junhan zhao, Yiming Cui, Yuning Mao, Shaoliang Nie, Jiahao Liu, Fuli Feng, Zenglin Xu, Cheng Han, Lifu Huang, Qifan Wang, Dongfang Liu

TL;DR

A novel Multimodal Prompt Tuning (M^2PT) approach for efficient instruction tuning of MLLMs, which effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (M$^2$PT) approach for efficient instruction tuning of MLLMs. M$^2$PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

TL;DR

Abstract

PT) approach for efficient instruction tuning of MLLMs. M

PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.

Paper Structure (35 sections, 5 equations, 10 figures, 5 tables)

This paper contains 35 sections, 5 equations, 10 figures, 5 tables.

Introduction
Related Work
Multimodal Large Language Models.
Instruction Tuning.
Parameter-Efficient Finetuning.
Methodology
Preliminaries
Multimodal Large Language Models
Multimodal Prompt Tuning
Implementation Details
Experiments
Experiment Setup
Datasets.
Main Result
Analysis and Discussion
...and 20 more sections

Figures (10)

Figure 1: Comparison of M$^2$PT and several PEFT methods, including LoRA hu2021lora, PTUM PTUMPM and VPT han2024facing, on multimodal tasks. Our approach exhibits superior performance across a range of benchmarks.
Figure 2: Overview of our M$^2$PT approach. Here, visual prompts are embedded into each layer of the Visual Encoder, and textual prompts are embedded into each layer of the LLM. These prompts facilitate the extraction and alignment of features across modalities (e.g., vision, language). The cross-modality interaction between visual and textual features is enhanced through layered integration, ultimately improving the model's capability in zero-shot instruction learning tasks (see §\ref{['sec:exp']}).
Figure 3: Comprehensive visualization of attention activation maps. This figure presents a detailed examination of the activation patterns within the last layer of LLM and Visual Encoder, respectively. As seen, the vision prompts and textual prompts have noticeably high activation levels during inference (i.e., $\bullet$ and $\bullet$ represent textual prompts' activation signal and visual prompts' activation signal, respectively).
Figure 4: Impact of Different Components.
Figure 5: Performance of Different Prompt Length. Each cell in the map corresponds to the score of a model with a textual prompt length (row) and a visual prompt length. A darker hue indicates a higher score, whereas a lighter hue signifies a lower score.
...and 5 more figures

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

TL;DR

Abstract

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)