DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis
Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, Jingtong Hu
TL;DR
The paper tackles multimodal sentiment analysis by addressing redundancy and conflicts across language, vision, and audio modalities. It introduces Disentangled-Language-Focused (DLF) learning, which separates modality-shared and modality-specific features using four geometry-based regularizers and enhances language representations with a Language-Focused Attractor (LFA) via language-guided cross-attention. A hierarchical prediction scheme combines pre-fused and post-fused features to boost accuracy, and the overall objective combines a decoupling loss $L_d$ with an MSA loss $L_{MSA}$, i.e., $L_{DLF}=L_d+L_{MSA}$. Empirical results on MOSI and MOSEI show state-of-the-art performance and robust ablations validate the contributions of FDM, LFA, and HP. The work advances practical multimodal sentiment understanding by prioritizing language-centric enhancement while maintaining effective cross-modal integration.
Abstract
Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such as language, vision, and audio, to enhance the understanding of human sentiment. While existing models often focus on extracting shared information across modalities or directly fusing heterogeneous modalities, such approaches can introduce redundancy and conflicts due to equal treatment of all modalities and the mutual transfer of information between modality pairs. To address these issues, we propose a Disentangled-Language-Focused (DLF) multimodal representation learning framework, which incorporates a feature disentanglement module to separate modality-shared and modality-specific information. To further reduce redundancy and enhance language-targeted features, four geometric measures are introduced to refine the disentanglement process. A Language-Focused Attractor (LFA) is further developed to strengthen language representation by leveraging complementary modality-specific information through a language-guided cross-attention mechanism. The framework also employs hierarchical predictions to improve overall accuracy. Extensive experiments on two popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant performance gains achieved by the proposed DLF framework. Comprehensive ablation studies further validate the effectiveness of the feature disentanglement module, language-focused attractor, and hierarchical predictions. Our code is available at https://github.com/pwang322/DLF.
