Table of Contents
Fetching ...

MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis

Hongyu Zhu, Lin Chen, Mounim A. El-Yacoubi, Mingsheng Shang

TL;DR

MS-Mix presents an emotion-aware data augmentation framework for multimodal sentiment analysis that explicitly accounts for semantic consistency across text, audio, and video. The trio of components—Sentiment-Aware Sample Selection (SASS), Sentiment Intensity Guided (SIG) mixing, and Sentiment Alignment Loss (SAL)—enables adaptive, modality-specific mixing and regularization, improving generalization under limited annotated data. Across MOSI, MOSEI, and SIMS, with six backbones, MS-Mix achieves state-of-the-art or competitive performance and demonstrates robustness to occlusion and hyperparameter variations. This approach offers a practical, end-to-end augmentation strategy to enhance cross-modal representation learning and sentiment prediction in real-world settings.

Abstract

Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources such as text, video, and audio. While deep learning models have advanced in network architecture design, they remain heavily limited by scarce multimodal annotated data. Although Mixup-based augmentation improves generalization in unimodal tasks, its direct application to MSA introduces critical challenges: random mixing often amplifies label ambiguity and semantic inconsistency due to the lack of emotion-aware mixing mechanisms. To overcome these issues, we propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings. The key components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS) strategy that effectively prevents semantic confusion caused by mixing samples with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module using multi-head self-attention to compute modality-specific mixing ratios dynamically based on their respective emotional intensities. (3) a Sentiment Alignment Loss (SAL) that aligns the prediction distributions across modalities, and incorporates the Kullback-Leibler-based loss as an additional regularization term to train the emotion intensity predictor and the backbone network jointly. Extensive experiments on three benchmark datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms existing methods, establishing a new standard for robust multimodal sentiment augmentation. The source code is available at: https://github.com/HongyuZhu-s/MS-Mix.

MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis

TL;DR

MS-Mix presents an emotion-aware data augmentation framework for multimodal sentiment analysis that explicitly accounts for semantic consistency across text, audio, and video. The trio of components—Sentiment-Aware Sample Selection (SASS), Sentiment Intensity Guided (SIG) mixing, and Sentiment Alignment Loss (SAL)—enables adaptive, modality-specific mixing and regularization, improving generalization under limited annotated data. Across MOSI, MOSEI, and SIMS, with six backbones, MS-Mix achieves state-of-the-art or competitive performance and demonstrates robustness to occlusion and hyperparameter variations. This approach offers a practical, end-to-end augmentation strategy to enhance cross-modal representation learning and sentiment prediction in real-world settings.

Abstract

Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources such as text, video, and audio. While deep learning models have advanced in network architecture design, they remain heavily limited by scarce multimodal annotated data. Although Mixup-based augmentation improves generalization in unimodal tasks, its direct application to MSA introduces critical challenges: random mixing often amplifies label ambiguity and semantic inconsistency due to the lack of emotion-aware mixing mechanisms. To overcome these issues, we propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings. The key components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS) strategy that effectively prevents semantic confusion caused by mixing samples with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module using multi-head self-attention to compute modality-specific mixing ratios dynamically based on their respective emotional intensities. (3) a Sentiment Alignment Loss (SAL) that aligns the prediction distributions across modalities, and incorporates the Kullback-Leibler-based loss as an additional regularization term to train the emotion intensity predictor and the backbone network jointly. Extensive experiments on three benchmark datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms existing methods, establishing a new standard for robust multimodal sentiment augmentation. The source code is available at: https://github.com/HongyuZhu-s/MS-Mix.

Paper Structure

This paper contains 23 sections, 19 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The differences between MS-Mix and traditional mixup methods represented by Manifold Mixup vanilla_mixup. Left: a. The traditional mixup method employs random selection of samples and an offline mixing ratios optimization strategy. b. Backbone. c. MS-Mix. Right: MS-Mix can generate augmented samples that better align with the original distribution on the MOSEI dataset. Where the KL represents the Kullback-Leibler divergence.
  • Figure 2: An overview of the overall structure of the proposed MS-Mix framework. (a) The SASS strategy computes the emotional semantic distance between samples. (b) The SIG module performs adaptive mixing of sample pairs with similar emotional semantics. (c) The SAL ($\mathcal{L}_{SAL}$) serves as an auxiliary regularization term that promotes alignment between predicted emotional intensity values and ground-truth labels via a KL-based loss. These components work collaboratively to enhance multimodal representation learning.
  • Figure 3: The t-SNE visualization of the original features and mixed features generated by MS-Mix (a) and $\mathcal{P}$owMix (b) on the MOSEI dataset using the MISA model. We employ a color scheme (blue/red) to differentiate the positive and negative categories, and use transparency to distinguish the original features from the mixed ones.
  • Figure 4: The impact of different parameter values. (a) Accuracy under different similarity thresholds $\delta$. (b) Accuracy under different combinations of $\xi_1$ and $\xi_2$. (c) Accuracy under different $\alpha$. (d) Accuracy under different attention heads $h$.
  • Figure 5: The performance of different mixup-based methods at different occlusion ratios on three datasets.