Table of Contents
Fetching ...

TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis with Missing and Noisy Modalities

Yan Zhuang, Minhao Liu, Yanru Zhang, Jiawen Deng, Fuji Ren

TL;DR

TMDC tackles the challenge of missing and noisy modalities in Multimodal Sentiment Analysis by introducing a two-stage framework: first, an Intra-Modality Denoising Stage (IMD) learns denoised modality-specific and modality-invariant representations from complete data using Variational Information Bottleneck (VIB) and attention-based modules; then, an Inter-Modality Complementation Stage (IMC) uses these representations to compensate for missing modalities during training on incomplete data, with a final prediction via a fully connected layer. The approach combines modality-specific and shared information through dedicated denoising modules and cross-modal attention, achieving state-of-the-art performance on MOSI, MOSEI, and IEMOCAP under fixed missing-modality scenarios and high-noise conditions. Key findings include the importance of the IMC stage, the contribution of modality-specific denoising, and the robustness of learned representations, even as noise increases. The work enhances the practicality of MSA in real-world sensing where data are frequently incomplete and noisy, and points to future work on reducing redundancy in shared representations to improve efficiency.

Abstract

Multimodal Sentiment Analysis (MSA) aims to infer human sentiment by integrating information from multiple modalities such as text, audio, and video. In real-world scenarios, however, the presence of missing modalities and noisy signals significantly hinders the robustness and accuracy of existing models. While prior works have made progress on these issues, they are typically addressed in isolation, limiting overall effectiveness in practical settings. To jointly mitigate the challenges posed by missing and noisy modalities, we propose a framework called Two-stage Modality Denoising and Complementation (TMDC). TMDC comprises two sequential training stages. In the Intra-Modality Denoising Stage, denoised modality-specific and modality-shared representations are extracted from complete data using dedicated denoising modules, reducing the impact of noise and enhancing representational robustness. In the Inter-Modality Complementation Stage, these representations are leveraged to compensate for missing modalities, thereby enriching the available information and further improving robustness. Extensive evaluations on MOSI, MOSEI, and IEMOCAP demonstrate that TMDC consistently achieves superior performance compared to existing methods, establishing new state-of-the-art results.

TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis with Missing and Noisy Modalities

TL;DR

TMDC tackles the challenge of missing and noisy modalities in Multimodal Sentiment Analysis by introducing a two-stage framework: first, an Intra-Modality Denoising Stage (IMD) learns denoised modality-specific and modality-invariant representations from complete data using Variational Information Bottleneck (VIB) and attention-based modules; then, an Inter-Modality Complementation Stage (IMC) uses these representations to compensate for missing modalities during training on incomplete data, with a final prediction via a fully connected layer. The approach combines modality-specific and shared information through dedicated denoising modules and cross-modal attention, achieving state-of-the-art performance on MOSI, MOSEI, and IEMOCAP under fixed missing-modality scenarios and high-noise conditions. Key findings include the importance of the IMC stage, the contribution of modality-specific denoising, and the robustness of learned representations, even as noise increases. The work enhances the practicality of MSA in real-world sensing where data are frequently incomplete and noisy, and points to future work on reducing redundancy in shared representations to improve efficiency.

Abstract

Multimodal Sentiment Analysis (MSA) aims to infer human sentiment by integrating information from multiple modalities such as text, audio, and video. In real-world scenarios, however, the presence of missing modalities and noisy signals significantly hinders the robustness and accuracy of existing models. While prior works have made progress on these issues, they are typically addressed in isolation, limiting overall effectiveness in practical settings. To jointly mitigate the challenges posed by missing and noisy modalities, we propose a framework called Two-stage Modality Denoising and Complementation (TMDC). TMDC comprises two sequential training stages. In the Intra-Modality Denoising Stage, denoised modality-specific and modality-shared representations are extracted from complete data using dedicated denoising modules, reducing the impact of noise and enhancing representational robustness. In the Inter-Modality Complementation Stage, these representations are leveraged to compensate for missing modalities, thereby enriching the available information and further improving robustness. Extensive evaluations on MOSI, MOSEI, and IEMOCAP demonstrate that TMDC consistently achieves superior performance compared to existing methods, establishing new state-of-the-art results.

Paper Structure

This paper contains 32 sections, 18 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The existing model yields correct predictions when the input contains only missing modalities (highlighted in green), but fails when both missing (green) and noisy modalities (red) are present.
  • Figure 2: Illustration of the proposed TMDC. TMDC includes two training stages. In the first stage, TMDC learns from complete modality information using two denoising modules. The modality-specific denoising module applies separate networks to each modality to remove noise while preserving unique modality information. Simultaneously, the modality-common denoising module employs a shared network to filter noise across multiple modalities and extract common information. In the second stage, the learned shared information is used to supplement missing modalities.
  • Figure 3: Visualization of the training losses in IMD stage on MOSI dataset.
  • Figure 4: Visualization of cosine similarity of representations on IEMOCAP dataset.
  • Figure 5: Analysis of hyper-parameter $\beta$ in VIB loss.
  • ...and 1 more figures