Table of Contents
Fetching ...

Semi-IIN: Semi-supervised Intra-inter modal Interaction Learning Network for Multimodal Sentiment Analysis

Jinhao Lin, Yifei Wang, Yanwu Xu, Qi Liu

TL;DR

This paper addresses the high annotation cost and label ambiguity in multimodal sentiment analysis by introducing Semi-IIN, a semi-supervised framework that dynamically balances intra- and inter-modal interactions via masked attention and a gate-based fusion. It combines two dedicated attention streams, IntraMA and InterMA, with a self-training scheme that generates reliable pseudo-labels from unlabeled data, and optimizes using a combination of $L_v$ (MSE), $L_e$ (cross-entropy), and $L_e^u$ losses. The approach reports state-of-the-art performance on MOSI and MOSEI, supported by extensive ablations and qualitative analyses, and provides insights into the effectiveness of separate intra- and inter-modal pathways for robust sentiment understanding. The work advances practical multimodal sentiment analysis by reducing labeling requirements and improving resilience to unlabeled data while enabling interpretable intra- and inter-modal interactions.

Abstract

Despite multimodal sentiment analysis being a fertile research ground that merits further investigation, current approaches take up high annotation cost and suffer from label ambiguity, non-amicable to high-quality labeled data acquisition. Furthermore, choosing the right interactions is essential because the significance of intra- or inter-modal interactions can differ among various samples. To this end, we propose Semi-IIN, a Semi-supervised Intra-inter modal Interaction learning Network for multimodal sentiment analysis. Semi-IIN integrates masked attention and gating mechanisms, enabling effective dynamic selection after independently capturing intra- and inter-modal interactive information. Combined with the self-training approach, Semi-IIN fully utilizes the knowledge learned from unlabeled data. Experimental results on two public datasets, MOSI and MOSEI, demonstrate the effectiveness of Semi-IIN, establishing a new state-of-the-art on several metrics. Code is available at https://github.com/flow-ljh/Semi-IIN.

Semi-IIN: Semi-supervised Intra-inter modal Interaction Learning Network for Multimodal Sentiment Analysis

TL;DR

This paper addresses the high annotation cost and label ambiguity in multimodal sentiment analysis by introducing Semi-IIN, a semi-supervised framework that dynamically balances intra- and inter-modal interactions via masked attention and a gate-based fusion. It combines two dedicated attention streams, IntraMA and InterMA, with a self-training scheme that generates reliable pseudo-labels from unlabeled data, and optimizes using a combination of (MSE), (cross-entropy), and losses. The approach reports state-of-the-art performance on MOSI and MOSEI, supported by extensive ablations and qualitative analyses, and provides insights into the effectiveness of separate intra- and inter-modal pathways for robust sentiment understanding. The work advances practical multimodal sentiment analysis by reducing labeling requirements and improving resilience to unlabeled data while enabling interpretable intra- and inter-modal interactions.

Abstract

Despite multimodal sentiment analysis being a fertile research ground that merits further investigation, current approaches take up high annotation cost and suffer from label ambiguity, non-amicable to high-quality labeled data acquisition. Furthermore, choosing the right interactions is essential because the significance of intra- or inter-modal interactions can differ among various samples. To this end, we propose Semi-IIN, a Semi-supervised Intra-inter modal Interaction learning Network for multimodal sentiment analysis. Semi-IIN integrates masked attention and gating mechanisms, enabling effective dynamic selection after independently capturing intra- and inter-modal interactive information. Combined with the self-training approach, Semi-IIN fully utilizes the knowledge learned from unlabeled data. Experimental results on two public datasets, MOSI and MOSEI, demonstrate the effectiveness of Semi-IIN, establishing a new state-of-the-art on several metrics. Code is available at https://github.com/flow-ljh/Semi-IIN.

Paper Structure

This paper contains 24 sections, 23 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: The importance of dynamically controlling the intra- and inter-modal interactive information. The arrows denote attention weights, whereas the blue and orange arrows indicate attention weight distributions "between words in language modeling" and "from visual to textual modalities", respectively. For instance, on the semantic similarity level, the words "scene" and "that" refer to the same concept, resulting in higher attention scores between them (arrow: "scene" to "that"). On the task-oriented level, the word "kind" is a key sentiment word and thus has a higher self-attention score (arrow: "kind" to "kind").
  • Figure 2: The overall architecture of Semi-IIN. Notably, $Z^{0}_{inter}$ and $Z^{0}_{intra}$ are the same as $Z$ in equation \ref{['Z']}.
  • Figure 3: Implementation of InterMA(top) and IntraMA(bottom).
  • Figure 4: Results under different proportions of labeled samples on MOSI dataset.
  • Figure 6: Case study for the Semi-IIN. The "Only Intra" and the "Only Inter" refer to the stacked IntraMAU and InterMAU prediction, respectively.
  • ...and 1 more figures