Table of Contents
Fetching ...

Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

Yuzhe Weng, Haotian Wang, Tian Gao, Kewei Li, Shutong Niu, Jun Du

TL;DR

This work tackles multimodal sentiment analysis when the text modality is missing, a scenario common due to annotation costs and ASR limitations. It introduces a Double-Flow Self-Distillation Framework combining Unified Modality Cross-Attention (UMCA) with a Modality Imagination Autoencoder (MIA), augmented by an LLM-based text simulation pipeline to produce pseudo-text representations from audio. Training optimizes a unified objective that includes sentiment regression and multiple distillation and alignment losses (MKD, RS, RNC) to harmonize complete and missing modalities, enabling robust performance in both settings. Evaluations on CMU-MOSEI show state-of-the-art MAE and competitive ACC with minimal degradation when text is absent, demonstrating practical robustness and reduced reliance on costly text data.

Abstract

In multimodal sentiment analysis, collecting text data is often more challenging than video or audio due to higher annotation costs and inconsistent automatic speech recognition (ASR) quality. To address this challenge, our study has developed a robust model that effectively integrates multimodal sentiment information, even in the absence of text modality. Specifically, we have developed a Double-Flow Self-Distillation Framework, including Unified Modality Cross-Attention (UMCA) and Modality Imagination Autoencoder (MIA), which excels at processing both scenarios with complete modalities and those with missing text modality. In detail, when the text modality is missing, our framework uses the LLM-based model to simulate the text representation from the audio modality, while the MIA module supplements information from the other two modalities to make the simulated text representation similar to the real text representation. To further align the simulated and real representations, and to enable the model to capture the continuous nature of sample orders in sentiment valence regression tasks, we have also introduced the Rank-N Contrast (RNC) loss function. When testing on the CMU-MOSEI, our model achieved outstanding performance on MAE and significantly outperformed other models when text modality is missing. The code is available at: https://github.com/WarmCongee/SDUMC

Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

TL;DR

This work tackles multimodal sentiment analysis when the text modality is missing, a scenario common due to annotation costs and ASR limitations. It introduces a Double-Flow Self-Distillation Framework combining Unified Modality Cross-Attention (UMCA) with a Modality Imagination Autoencoder (MIA), augmented by an LLM-based text simulation pipeline to produce pseudo-text representations from audio. Training optimizes a unified objective that includes sentiment regression and multiple distillation and alignment losses (MKD, RS, RNC) to harmonize complete and missing modalities, enabling robust performance in both settings. Evaluations on CMU-MOSEI show state-of-the-art MAE and competitive ACC with minimal degradation when text is absent, demonstrating practical robustness and reduced reliance on costly text data.

Abstract

In multimodal sentiment analysis, collecting text data is often more challenging than video or audio due to higher annotation costs and inconsistent automatic speech recognition (ASR) quality. To address this challenge, our study has developed a robust model that effectively integrates multimodal sentiment information, even in the absence of text modality. Specifically, we have developed a Double-Flow Self-Distillation Framework, including Unified Modality Cross-Attention (UMCA) and Modality Imagination Autoencoder (MIA), which excels at processing both scenarios with complete modalities and those with missing text modality. In detail, when the text modality is missing, our framework uses the LLM-based model to simulate the text representation from the audio modality, while the MIA module supplements information from the other two modalities to make the simulated text representation similar to the real text representation. To further align the simulated and real representations, and to enable the model to capture the continuous nature of sample orders in sentiment valence regression tasks, we have also introduced the Rank-N Contrast (RNC) loss function. When testing on the CMU-MOSEI, our model achieved outstanding performance on MAE and significantly outperformed other models when text modality is missing. The code is available at: https://github.com/WarmCongee/SDUMC

Paper Structure

This paper contains 18 sections, 8 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overall architecture of Double-Flow Self-Distillation Framework for modality missing. In the figure, the middle part of the upper row shows the training process of the framework. The lower row shows the specific network structure of each module.
  • Figure 2: Feature similarity visualization sorted by labels.