Table of Contents
Fetching ...

FINE: Factorized multimodal sentiment analysis via mutual INformation Estimation

Yadong Liu, Shangfei Wang

TL;DR

FINE tackles multimodal sentiment analysis by factorizing per-modality features into shared and unique, and further filtering task-irrelevant noise using mutual information objectives. It introduces Mixture of Q-Formers for early fine-grained feature extraction, a Factorized Task-Relevant Encoder for MI-guided disentanglement, and a Dynamic Contrastive Queue for long-range discrimination, all fused via a Transformer. The combined objective includes multimodal and unimodal predictions, MI losses, reconstruction, and contrastive terms, achieving state-of-the-art results on MOSI, MOSEI, UR-FUNNY, and CH-SIMS datasets. The work demonstrates that principled disentanglement of cross-modal information and memory-enabled contrastive learning can robustly handle asynchronous signals and noise in real-world multimodal sentiment tasks.

Abstract

Multimodal sentiment analysis remains a challenging task due to the inherent heterogeneity across modalities. Such heterogeneity often manifests as asynchronous signals, imbalanced information between modalities, and interference from task-irrelevant noise, hindering the learning of robust and accurate sentiment representations. To address these issues, we propose a factorized multimodal fusion framework that first disentangles each modality into shared and unique representations, and then suppresses task-irrelevant noise within both to retain only sentiment-critical representations. This fine-grained decomposition improves representation quality by reducing redundancy, prompting cross-modal complementarity, and isolating task-relevant sentiment cues. Rather than manipulating the feature space directly, we adopt a mutual information-based optimization strategy to guide the factorization process in a more stable and principled manner. To further support feature extraction and long-term temporal modeling, we introduce two auxiliary modules: a Mixture of Q-Formers, placed before factorization, which precedes the factorization and uses learnable queries to extract fine-grained affective features from multiple modalities, and a Dynamic Contrastive Queue, placed after factorization, which stores latest high-level representations for contrastive learning, enabling the model to capture long-range discriminative patterns and improve class-level separability. Extensive experiments on multiple public datasets demonstrate that our method consistently outperforms existing approaches, validating the effectiveness and robustness of the proposed framework.

FINE: Factorized multimodal sentiment analysis via mutual INformation Estimation

TL;DR

FINE tackles multimodal sentiment analysis by factorizing per-modality features into shared and unique, and further filtering task-irrelevant noise using mutual information objectives. It introduces Mixture of Q-Formers for early fine-grained feature extraction, a Factorized Task-Relevant Encoder for MI-guided disentanglement, and a Dynamic Contrastive Queue for long-range discrimination, all fused via a Transformer. The combined objective includes multimodal and unimodal predictions, MI losses, reconstruction, and contrastive terms, achieving state-of-the-art results on MOSI, MOSEI, UR-FUNNY, and CH-SIMS datasets. The work demonstrates that principled disentanglement of cross-modal information and memory-enabled contrastive learning can robustly handle asynchronous signals and noise in real-world multimodal sentiment tasks.

Abstract

Multimodal sentiment analysis remains a challenging task due to the inherent heterogeneity across modalities. Such heterogeneity often manifests as asynchronous signals, imbalanced information between modalities, and interference from task-irrelevant noise, hindering the learning of robust and accurate sentiment representations. To address these issues, we propose a factorized multimodal fusion framework that first disentangles each modality into shared and unique representations, and then suppresses task-irrelevant noise within both to retain only sentiment-critical representations. This fine-grained decomposition improves representation quality by reducing redundancy, prompting cross-modal complementarity, and isolating task-relevant sentiment cues. Rather than manipulating the feature space directly, we adopt a mutual information-based optimization strategy to guide the factorization process in a more stable and principled manner. To further support feature extraction and long-term temporal modeling, we introduce two auxiliary modules: a Mixture of Q-Formers, placed before factorization, which precedes the factorization and uses learnable queries to extract fine-grained affective features from multiple modalities, and a Dynamic Contrastive Queue, placed after factorization, which stores latest high-level representations for contrastive learning, enabling the model to capture long-range discriminative patterns and improve class-level separability. Extensive experiments on multiple public datasets demonstrate that our method consistently outperforms existing approaches, validating the effectiveness and robustness of the proposed framework.

Paper Structure

This paper contains 21 sections, 26 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: A sample of MSA, incorporating three modalities: visual, textual, and audio. The bottom-left section displays fine-grained sentiment analysis, while the bottom-right section shows the label and annotation for this example.
  • Figure 2: The overview of FINE. $\mathrm{Pred}_{U}$ represents the unimodal prediction, while $\mathrm{Pred}_{M}$ corresponds to the multimodal prediction.
  • Figure 3: A Venn diagram illustrating mutual information among modalities $\mathcal{X}_1$, $\mathcal{X}_2$, $\mathcal{X}_3$ and the task $\mathcal{Y}$. The blue region denotes task-relevant information, with the blue grid marking the ideal target for learning. This target combines task-relevant shared information (red) and unique information (green). Achieving it requires suppressing task-irrelevant noise from both components.
  • Figure 4: The structure of FTRE. The bottom-left corner depicts the unimodal encoders, where each modality $X_i$ is processed independently. The top-left corner illustrates the direct encoding of labels, resulting in one shared and three unique label embeddings. The top-right corner represents the estimation of four distinct types of mutual information.
  • Figure 5: Visualization of features obtained after FTRE.
  • ...and 3 more figures