FINE: Factorized multimodal sentiment analysis via mutual INformation Estimation
Yadong Liu, Shangfei Wang
TL;DR
FINE tackles multimodal sentiment analysis by factorizing per-modality features into shared and unique, and further filtering task-irrelevant noise using mutual information objectives. It introduces Mixture of Q-Formers for early fine-grained feature extraction, a Factorized Task-Relevant Encoder for MI-guided disentanglement, and a Dynamic Contrastive Queue for long-range discrimination, all fused via a Transformer. The combined objective includes multimodal and unimodal predictions, MI losses, reconstruction, and contrastive terms, achieving state-of-the-art results on MOSI, MOSEI, UR-FUNNY, and CH-SIMS datasets. The work demonstrates that principled disentanglement of cross-modal information and memory-enabled contrastive learning can robustly handle asynchronous signals and noise in real-world multimodal sentiment tasks.
Abstract
Multimodal sentiment analysis remains a challenging task due to the inherent heterogeneity across modalities. Such heterogeneity often manifests as asynchronous signals, imbalanced information between modalities, and interference from task-irrelevant noise, hindering the learning of robust and accurate sentiment representations. To address these issues, we propose a factorized multimodal fusion framework that first disentangles each modality into shared and unique representations, and then suppresses task-irrelevant noise within both to retain only sentiment-critical representations. This fine-grained decomposition improves representation quality by reducing redundancy, prompting cross-modal complementarity, and isolating task-relevant sentiment cues. Rather than manipulating the feature space directly, we adopt a mutual information-based optimization strategy to guide the factorization process in a more stable and principled manner. To further support feature extraction and long-term temporal modeling, we introduce two auxiliary modules: a Mixture of Q-Formers, placed before factorization, which precedes the factorization and uses learnable queries to extract fine-grained affective features from multiple modalities, and a Dynamic Contrastive Queue, placed after factorization, which stores latest high-level representations for contrastive learning, enabling the model to capture long-range discriminative patterns and improve class-level separability. Extensive experiments on multiple public datasets demonstrate that our method consistently outperforms existing approaches, validating the effectiveness and robustness of the proposed framework.
