Table of Contents
Fetching ...

Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach

Weide Liu, Huijing Zhan, Hao Chen, Fengmao Lv

TL;DR

The paper tackles multimodal sentiment analysis under missing modality scenarios and proposes a knowledge-transfer network to reconstruct the missing audio from visual and textual cues, coupled with a cross-modal attention mechanism to fuse reconstructed and observed features. This approach uses transformer-based encoders for each modality and a consistency loss to align reconstructions with ground-truth audio, while the cross-modal attention builds a robust joint representation for sentiment prediction. Extensive experiments on CMU-MOSI, CMU-MOSEI, and IEMOCAP show that the method outperforms missing-modality baselines and can approach, or even match, fully supervised multimodal performance, with ablations highlighting the effectiveness of language-targeted fusion and the superiority of the L2 consistency loss in this setting. Overall, the work advances practical multimodal sentiment analysis by enabling reliable performance when modalities are unavailable during testing or training.

Abstract

Multimodal sentiment analysis aims to identify the emotions expressed by individuals through visual, language, and acoustic cues. However, most of the existing research efforts assume that all modalities are available during both training and testing, making their algorithms susceptible to the missing modality scenario. In this paper, we propose a novel knowledge-transfer network to translate between different modalities to reconstruct the missing audio modalities. Moreover, we develop a cross-modality attention mechanism to retain the maximal information of the reconstructed and observed modalities for sentiment prediction. Extensive experiments on three publicly available datasets demonstrate significant improvements over baselines and achieve comparable results to the previous methods with complete multi-modality supervision.

Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach

TL;DR

The paper tackles multimodal sentiment analysis under missing modality scenarios and proposes a knowledge-transfer network to reconstruct the missing audio from visual and textual cues, coupled with a cross-modal attention mechanism to fuse reconstructed and observed features. This approach uses transformer-based encoders for each modality and a consistency loss to align reconstructions with ground-truth audio, while the cross-modal attention builds a robust joint representation for sentiment prediction. Extensive experiments on CMU-MOSI, CMU-MOSEI, and IEMOCAP show that the method outperforms missing-modality baselines and can approach, or even match, fully supervised multimodal performance, with ablations highlighting the effectiveness of language-targeted fusion and the superiority of the L2 consistency loss in this setting. Overall, the work advances practical multimodal sentiment analysis by enabling reliable performance when modalities are unavailable during testing or training.

Abstract

Multimodal sentiment analysis aims to identify the emotions expressed by individuals through visual, language, and acoustic cues. However, most of the existing research efforts assume that all modalities are available during both training and testing, making their algorithms susceptible to the missing modality scenario. In this paper, we propose a novel knowledge-transfer network to translate between different modalities to reconstruct the missing audio modalities. Moreover, we develop a cross-modality attention mechanism to retain the maximal information of the reconstructed and observed modalities for sentiment prediction. Extensive experiments on three publicly available datasets demonstrate significant improvements over baselines and achieve comparable results to the previous methods with complete multi-modality supervision.
Paper Structure (9 sections, 6 equations, 1 figure, 5 tables)

This paper contains 9 sections, 6 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: The pipeline of our method. The A' denotes the reconstructed audio information.