Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

Zirun Guo; Tao Jin; Zhou Zhao

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

Zirun Guo, Tao Jin, Zhou Zhao

TL;DR

This work tackles missing modalities in multimodal sentiment analysis and emotion recognition by introducing a prompt-learning framework that freezes the backbone and trains three specialized prompts. The Missing Modality Generation Module (MMGM) uses generative prompts to synthesize missing features, while missing-signal and missing-type prompts inform the model about missingness and enable cross-modal learning, all with linear scalability in the number of modalities. Pretraining on a high-resource dataset and subsequent prompt-based adaptation yield strong, parameter-efficient performance across CMU-MOSEI, CMU-MOSI, IEMOCAP, and CH-SIMS, with notable gains when modalities are incomplete and robust generalization to different backbones. The approach reduces computational overhead and offers practical deployment benefits for real-world systems facing missing data and resource constraints.

Abstract

The development of multimodal models has significantly advanced multimodal sentiment analysis and emotion recognition. However, in real-world applications, the presence of various missing modality cases often leads to a degradation in the model's performance. In this work, we propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities. Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts. These prompts enable the generation of missing modality features and facilitate the learning of intra- and inter-modality information. Through prompt learning, we achieve a substantial reduction in the number of trainable parameters. Our proposed method outperforms other methods significantly across all evaluation metrics. Extensive experiments and ablation studies are conducted to demonstrate the effectiveness and robustness of our method, showcasing its ability to effectively handle missing modalities.

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

TL;DR

Abstract

Paper Structure (14 sections, 5 equations, 7 figures, 3 tables)

This paper contains 14 sections, 5 equations, 7 figures, 3 tables.

Introduction
Related Works
Proposed Method
Overall Architecture
Missing Modality Generation Module (MMGM)
Missing-signal and Missing-type Prompts
Experiments
Datasets and Evaluation Metrics
Baselines
Implementation Details
Main Results
Generalization Ability
Ablation Study
Conclusion

Figures (7)

Figure 1: The overall architecture of our proposed method. A batch of data that contains different missing modality cases is fed to the Missing Modality Generation Module (see Section \ref{['s32']}) to obtain generated features. They are then passed to the pre-trained backbone with missing-signal prompts and missing-type prompts (see Section \ref{['s33']}).
Figure 2: The illustration of Missing Modality Generation Module (MMGM). The figure shows the process of generating the audio feature of an example of $\boldsymbol{x}=(x^{am},x^v,x^t)$ where the audio modality is missing and the other two are not. It can be described using the Equation \ref{['e1']}.
Figure 3: The illustration of attaching missing-type prompts to the Transformer. With the missing-type matrix $\mathbf{M_P}$, we generate missing-type prompts $P^\prime_{MT}$ for different missing modality cases. The figure shows the process of attaching missing-type prompts using an example of $\boldsymbol{x}=(x^{am},x^v,x^{tm})$ where audio and text modalities are missing.
Figure 4: Performance comparison with different modality missing rates during tests. (a): ACC on CMU-MOSI. (b): F1 score on CMU-MOSI (c): MAE on CMU-MOSI. (d): Corr on CMU-MOSI. (e): ACC on IEMOCAP. (f): F1 score on IEMOCAP. (g): ACC on CH-SIMS. (h): F1 score on CH-SIMS.
Figure 5: The effectiveness of three types of prompts on an example of CH-SIMS. The ground truth of the sample is "Negative". We report the results when the visual modality is missing.
...and 2 more figures

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

TL;DR

Abstract

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)