Table of Contents
Fetching ...

Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection

Dingkang Yang, Mingcheng Li, Xuecheng Wu, Zhaoyu Chen, Kaixun Jiang, Keliang Liu, Peng Zhai, Lihua Zhang

TL;DR

The paper tackles multimodal sentiment analysis (MSA) by addressing imbalanced unimodal contributions and noise in non-language modalities. It introduces MODS, a framework with Graph-based Dynamic Compression to reduce redundancy, a sample-adaptive Primary Modality Selector for per-sample modality dominance, and a Primary-modality-Centric Cross Attention module to reinforce the dominant modality while integrating cross-modal information. The approach achieves state-of-the-art results across MOSI, MOSEI, SIMS, and SIMSv2, with ablations confirming the value of each component. Overall, MODS enables robust, sample-aware multimodal fusion that better handles modality variability and noise, enhancing predictive accuracy in real-world video sentiment analysis.

Abstract

Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos. However, imbalanced unimodal performance often leads to suboptimal fused representations. Existing approaches typically adopt fixed primary modality strategies to maximize dominant modality advantages, yet fail to adapt to dynamic variations in modality importance across different samples. Moreover, non-language modalities suffer from sequential redundancy and noise, degrading model performance when they serve as primary inputs. To address these issues, this paper proposes a modality optimization and dynamic primary modality selection framework (MODS). First, a Graph-based Dynamic Sequence Compressor (GDC) is constructed, which employs capsule networks and graph convolution to reduce sequential redundancy in acoustic/visual modalities. Then, we develop a sample-adaptive Primary Modality Selector (MSelector) for dynamic dominance determination. Finally, a Primary-modality-Centric Cross-Attention (PCCA) module is designed to enhance dominant modalities while facilitating cross-modal interaction. Extensive experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods, achieving superior performance by effectively balancing modality contributions and eliminating redundant noise.

Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection

TL;DR

The paper tackles multimodal sentiment analysis (MSA) by addressing imbalanced unimodal contributions and noise in non-language modalities. It introduces MODS, a framework with Graph-based Dynamic Compression to reduce redundancy, a sample-adaptive Primary Modality Selector for per-sample modality dominance, and a Primary-modality-Centric Cross Attention module to reinforce the dominant modality while integrating cross-modal information. The approach achieves state-of-the-art results across MOSI, MOSEI, SIMS, and SIMSv2, with ablations confirming the value of each component. Overall, MODS enables robust, sample-aware multimodal fusion that better handles modality variability and noise, enhancing predictive accuracy in real-world video sentiment analysis.

Abstract

Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos. However, imbalanced unimodal performance often leads to suboptimal fused representations. Existing approaches typically adopt fixed primary modality strategies to maximize dominant modality advantages, yet fail to adapt to dynamic variations in modality importance across different samples. Moreover, non-language modalities suffer from sequential redundancy and noise, degrading model performance when they serve as primary inputs. To address these issues, this paper proposes a modality optimization and dynamic primary modality selection framework (MODS). First, a Graph-based Dynamic Sequence Compressor (GDC) is constructed, which employs capsule networks and graph convolution to reduce sequential redundancy in acoustic/visual modalities. Then, we develop a sample-adaptive Primary Modality Selector (MSelector) for dynamic dominance determination. Finally, a Primary-modality-Centric Cross-Attention (PCCA) module is designed to enhance dominant modalities while facilitating cross-modal interaction. Extensive experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods, achieving superior performance by effectively balancing modality contributions and eliminating redundant noise.

Paper Structure

This paper contains 17 sections, 25 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The overall architecture of the proposed MODS framework.
  • Figure 2: The architecture of the proposed GDC module.
  • Figure 3: The architecture of the proposed PCCA module.
  • Figure 4: Display of cases and modality weights on SIMS.