Table of Contents
Fetching ...

Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis

Chunlei Meng, Jiabin Luo, Zhenglin Yan, Zhenyu Yu, Rong Fu, Zhongxue Gan, Chun Ouyang

TL;DR

The paper tackles multimodal sentiment analysis by introducing Tri-Subspace Disentanglement (TSD), which factorizes multimodal features into a common subspace, submodally shared subspaces for pairwise cues, and private subspaces for modality-specific information. A decoupling supervisor enforces clean separation among these subspaces, while the Subspace-Aware Cross-Attention (SACA) fusion module adaptively integrates information from all subspaces to produce robust representations. Empirical results on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate state-of-the-art performance and strong transferability to related tasks, with ablations confirming the contributions of tri-subspace disentanglement and SACA. The approach advances multimodal representation learning by explicitly modeling partially shared signals and providing an interpretable fusion mechanism with dynamic subspace weighting.

Abstract

Multimodal Sentiment Analysis (MSA) integrates language, visual, and acoustic modalities to infer human sentiment. Most existing methods either focus on globally shared representations or modality-specific features, while overlooking signals that are shared only by certain modality pairs. This limits the expressiveness and discriminative power of multimodal representations. To address this limitation, we propose a Tri-Subspace Disentanglement (TSD) framework that explicitly factorizes features into three complementary subspaces: a common subspace capturing global consistency, submodally-shared subspaces modeling pairwise cross-modal synergies, and private subspaces preserving modality-specific cues. To keep these subspaces pure and independent, we introduce a decoupling supervisor together with structured regularization losses. We further design a Subspace-Aware Cross-Attention (SACA) fusion module that adaptively models and integrates information from the three subspaces to obtain richer and more robust representations. Experiments on CMU-MOSI and CMU-MOSEI demonstrate that TSD achieves state-of-the-art performance across all key metrics, reaching 0.691 MAE on CMU-MOSI and 54.9% ACC-7 on CMU-MOSEI, and also transfers well to multimodal intent recognition tasks. Ablation studies confirm that tri-subspace disentanglement and SACA jointly enhance the modeling of multi-granular cross-modal sentiment cues.

Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis

TL;DR

The paper tackles multimodal sentiment analysis by introducing Tri-Subspace Disentanglement (TSD), which factorizes multimodal features into a common subspace, submodally shared subspaces for pairwise cues, and private subspaces for modality-specific information. A decoupling supervisor enforces clean separation among these subspaces, while the Subspace-Aware Cross-Attention (SACA) fusion module adaptively integrates information from all subspaces to produce robust representations. Empirical results on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate state-of-the-art performance and strong transferability to related tasks, with ablations confirming the contributions of tri-subspace disentanglement and SACA. The approach advances multimodal representation learning by explicitly modeling partially shared signals and providing an interpretable fusion mechanism with dynamic subspace weighting.

Abstract

Multimodal Sentiment Analysis (MSA) integrates language, visual, and acoustic modalities to infer human sentiment. Most existing methods either focus on globally shared representations or modality-specific features, while overlooking signals that are shared only by certain modality pairs. This limits the expressiveness and discriminative power of multimodal representations. To address this limitation, we propose a Tri-Subspace Disentanglement (TSD) framework that explicitly factorizes features into three complementary subspaces: a common subspace capturing global consistency, submodally-shared subspaces modeling pairwise cross-modal synergies, and private subspaces preserving modality-specific cues. To keep these subspaces pure and independent, we introduce a decoupling supervisor together with structured regularization losses. We further design a Subspace-Aware Cross-Attention (SACA) fusion module that adaptively models and integrates information from the three subspaces to obtain richer and more robust representations. Experiments on CMU-MOSI and CMU-MOSEI demonstrate that TSD achieves state-of-the-art performance across all key metrics, reaching 0.691 MAE on CMU-MOSI and 54.9% ACC-7 on CMU-MOSEI, and also transfers well to multimodal intent recognition tasks. Ablation studies confirm that tri-subspace disentanglement and SACA jointly enhance the modeling of multi-granular cross-modal sentiment cues.
Paper Structure (20 sections, 22 equations, 7 figures, 4 tables)

This paper contains 20 sections, 22 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An example of a submodally shared cue: the utterance "that's really great" is delivered with a sarcastic tone and a disdainful facial expression. While the lexical content suggests positive sentiment, the conflicting acoustic and visual cues jointly convey negative affect (sarcasm).
  • Figure 2: Overview of the proposed TSD framework. Given multimodal inputs, TSD disentangles features into three complementary subspaces and fuses them via a Subspace-Aware Cross-Attention (SACA) module.
  • Figure 3: Qualitative examples illustrating that incorporating the submodally shared subspace enables TSD to better capture cross-modal cues (e.g., sarcasm), producing sentiment predictions closer to the ground truth.
  • Figure 4: t-SNE visualization of feature distributions on CMU-MOSI. Colors indicate sentiment polarity from negative (dark) to positive (yellow). The full TSD model exhibits a clearer sentiment gradient than the variant without SACA and sub-shared subspaces.
  • Figure 5: (a) Average fusion weights of each subspace (Common, Private, Sub-Shared) in TSD on CMU-MOSI and CMU-MOSEI. (b) Estimated contribution of each subspace during fusion. Higher values and darker colors denote greater subspace significance.
  • ...and 2 more figures