Table of Contents
Fetching ...

DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis

Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, Jingtong Hu

TL;DR

The paper tackles multimodal sentiment analysis by addressing redundancy and conflicts across language, vision, and audio modalities. It introduces Disentangled-Language-Focused (DLF) learning, which separates modality-shared and modality-specific features using four geometry-based regularizers and enhances language representations with a Language-Focused Attractor (LFA) via language-guided cross-attention. A hierarchical prediction scheme combines pre-fused and post-fused features to boost accuracy, and the overall objective combines a decoupling loss $L_d$ with an MSA loss $L_{MSA}$, i.e., $L_{DLF}=L_d+L_{MSA}$. Empirical results on MOSI and MOSEI show state-of-the-art performance and robust ablations validate the contributions of FDM, LFA, and HP. The work advances practical multimodal sentiment understanding by prioritizing language-centric enhancement while maintaining effective cross-modal integration.

Abstract

Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such as language, vision, and audio, to enhance the understanding of human sentiment. While existing models often focus on extracting shared information across modalities or directly fusing heterogeneous modalities, such approaches can introduce redundancy and conflicts due to equal treatment of all modalities and the mutual transfer of information between modality pairs. To address these issues, we propose a Disentangled-Language-Focused (DLF) multimodal representation learning framework, which incorporates a feature disentanglement module to separate modality-shared and modality-specific information. To further reduce redundancy and enhance language-targeted features, four geometric measures are introduced to refine the disentanglement process. A Language-Focused Attractor (LFA) is further developed to strengthen language representation by leveraging complementary modality-specific information through a language-guided cross-attention mechanism. The framework also employs hierarchical predictions to improve overall accuracy. Extensive experiments on two popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant performance gains achieved by the proposed DLF framework. Comprehensive ablation studies further validate the effectiveness of the feature disentanglement module, language-focused attractor, and hierarchical predictions. Our code is available at https://github.com/pwang322/DLF.

DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis

TL;DR

The paper tackles multimodal sentiment analysis by addressing redundancy and conflicts across language, vision, and audio modalities. It introduces Disentangled-Language-Focused (DLF) learning, which separates modality-shared and modality-specific features using four geometry-based regularizers and enhances language representations with a Language-Focused Attractor (LFA) via language-guided cross-attention. A hierarchical prediction scheme combines pre-fused and post-fused features to boost accuracy, and the overall objective combines a decoupling loss with an MSA loss , i.e., . Empirical results on MOSI and MOSEI show state-of-the-art performance and robust ablations validate the contributions of FDM, LFA, and HP. The work advances practical multimodal sentiment understanding by prioritizing language-centric enhancement while maintaining effective cross-modal integration.

Abstract

Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such as language, vision, and audio, to enhance the understanding of human sentiment. While existing models often focus on extracting shared information across modalities or directly fusing heterogeneous modalities, such approaches can introduce redundancy and conflicts due to equal treatment of all modalities and the mutual transfer of information between modality pairs. To address these issues, we propose a Disentangled-Language-Focused (DLF) multimodal representation learning framework, which incorporates a feature disentanglement module to separate modality-shared and modality-specific information. To further reduce redundancy and enhance language-targeted features, four geometric measures are introduced to refine the disentanglement process. A Language-Focused Attractor (LFA) is further developed to strengthen language representation by leveraging complementary modality-specific information through a language-guided cross-attention mechanism. The framework also employs hierarchical predictions to improve overall accuracy. Extensive experiments on two popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant performance gains achieved by the proposed DLF framework. Comprehensive ablation studies further validate the effectiveness of the feature disentanglement module, language-focused attractor, and hierarchical predictions. Our code is available at https://github.com/pwang322/DLF.

Paper Structure

This paper contains 17 sections, 15 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Task pipeline of the Multimodal Sentiment Analysis, and varied performance of different modalities.
  • Figure 2: Overview of the proposed DLF framework. The framework follows a pipeline of feature extraction, disentanglement, enhancement, fusion, and prediction, featuring three core components: the feature disentanglement module, the Language-Focused Attractor (LFA), and hierarchical predictions (including shared prediction, specific prediction, and final prediction).
  • Figure 3: The details of the proposed LFA. The language-focused cross-attention and self-attention achieve targeted feature enhancement: $V$$\rightarrow$$L$, $A$$\rightarrow$$L$, and $L$$\rightarrow$$L$.
  • Figure 4: Left: Confusion matrix on MOSI. Right: Corresponding accuracy for each sentiment. HN: Highly Negative; N: Negative; WN: Weakly Negative; NT: Neutral; WP: Weak Positive; P: Positive; HP: Highly Positive.
  • Figure 5: Visualization of the fused multimodal representations. HN: Highly Negative; N: Negative; WN: Weakly Negative; NT: Neutral; WP: Weak Positive; P: Positive; HP: Highly Positive.