Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

Zichen Wu; Hsiu-Yuan Huang; Fanyi Qu; Yunfang Wu

Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

Zichen Wu, Hsiu-Yuan Huang, Fanyi Qu, Yunfang Wu

TL;DR

This paper proposes Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF), a novel multi-modal soft prompt framework based on the unified vision-language model (VLM), which significantly outperforms other widely-used prompt methods on VLMs or task-specific methods.

Abstract

Deep multimodal semantic understanding that goes beyond the mere superficial content relation mining has received increasing attention in the realm of artificial intelligence. The challenges of collecting and annotating high-quality multi-modal data have underscored the significance of few-shot learning. In this paper, we focus on two critical tasks under this context: few-shot multi-modal sarcasm detection (MSD) and multi-modal sentiment analysis (MSA). To address them, we propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF), a novel multi-modal soft prompt framework based on the unified vision-language model (VLM). Specifically, we design three experts of soft prompts: a text prompt and an image prompt that extract modality-specific features to enrich the single-modal representation, and a unified prompt to assist multi-modal interaction. Additionally, we reorganize Transformer layers into several blocks and introduce cross-modal prompt attention between adjacent blocks, which smoothens the transition from single-modal representation to multi-modal fusion. On both MSD and MSA datasets in few-shot setting, our proposed model not only surpasses the 8.2B model InstructBLIP with merely 2% parameters (150M), but also significantly outperforms other widely-used prompt methods on VLMs or task-specific methods.

Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

TL;DR

Abstract

Paper Structure (32 sections, 8 equations, 4 figures, 6 tables)

This paper contains 32 sections, 8 equations, 4 figures, 6 tables.

Introduction
Related Work
Multi-modal Sarcasm Detection
Multi-modal Sentiment Analysis
Multi-modal Prompt Learning
Preliminary
The Vision-language Pre-trained Model: VLMo
Tuning on Multi-modal Tasks
Finetuning.
Manual Prompt.
Soft Prompt.
The Proposed Model
Mixture-of-Prompt-Experts
Stage 1.
Stage 2.
...and 17 more sections

Figures (4)

Figure 1: The framework of VLMo for image-text detection task. We demonstrate two methods. In finetuning, the [CLS] representation is fed to a classification head, while in manual prompt, the representation of [MASK] is fed to a verbalizer.
Figure 2: Our proposed MoPE-BAF model for multi-modal semantic understanding .
Figure 3: Receptive fields of different prompts, image patches, text tokens in the self-attention module when using MoPE. VP and LP are shorthand for V-Prompt, L-Prompt.
Figure 4: (a) F1 performance training MoPE with different prompt lengths. (b) F1 scores training MoPE-BAF with different block numbers. (c) Comparison between VLMo and VLMo + MoPE-BAF under different training shots.

Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

TL;DR

Abstract

Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (4)