Table of Contents
Fetching ...

PSA-MF: Personality-Sentiment Aligned Multi-Level Fusion for Multimodal Sentiment Analysis

Heng Xie, Kang Zhu, Zhengqi Wen, Jianhua Tao, Xuefei Liu, Ruibo Fu, Changsheng Li

TL;DR

PSA-MF introduces personality-sentiment alignment and a multi-level fusion framework for multimodal sentiment analysis. By extracting personalized sentiment features from text and aligning them with personality signals, and by progressively fusing textual, visual, and audio cues through pre-fusion, cross-modal interaction, and enhanced fusion stages, the method achieves state-of-the-art results on MOSI and MOSEI. Ablation studies confirm the critical role of personality features and alignment losses, while layer-wise analysis highlights the importance of aligning at appropriate depths to balance semantic richness and modality noise. The work advances personalized sentiment understanding across modalities with a concrete architectural and loss-function design that improves cross-modal sentiment transmission and interpretation.

Abstract

Multimodal sentiment analysis (MSA) is a research field that recognizes human sentiments by combining textual, visual, and audio modalities. The main challenge lies in integrating sentiment-related information from different modalities, which typically arises during the unimodal feature extraction phase and the multimodal feature fusion phase. Existing methods extract only shallow information from unimodal features during the extraction phase, neglecting sentimental differences across different personalities. During the fusion phase, they directly merge the feature information from each modality without considering differences at the feature level. This ultimately affects the model's recognition performance. To address this problem, we propose a personality-sentiment aligned multi-level fusion framework. We introduce personality traits during the feature extraction phase and propose a novel personality-sentiment alignment method to obtain personalized sentiment embeddings from the textual modality for the first time. In the fusion phase, we introduce a novel multi-level fusion method. This method gradually integrates sentimental information from textual, visual, and audio modalities through multimodal pre-fusion and a multi-level enhanced fusion strategy. Our method has been evaluated through multiple experiments on two commonly used datasets, achieving state-of-the-art results.

PSA-MF: Personality-Sentiment Aligned Multi-Level Fusion for Multimodal Sentiment Analysis

TL;DR

PSA-MF introduces personality-sentiment alignment and a multi-level fusion framework for multimodal sentiment analysis. By extracting personalized sentiment features from text and aligning them with personality signals, and by progressively fusing textual, visual, and audio cues through pre-fusion, cross-modal interaction, and enhanced fusion stages, the method achieves state-of-the-art results on MOSI and MOSEI. Ablation studies confirm the critical role of personality features and alignment losses, while layer-wise analysis highlights the importance of aligning at appropriate depths to balance semantic richness and modality noise. The work advances personalized sentiment understanding across modalities with a concrete architectural and loss-function design that improves cross-modal sentiment transmission and interpretation.

Abstract

Multimodal sentiment analysis (MSA) is a research field that recognizes human sentiments by combining textual, visual, and audio modalities. The main challenge lies in integrating sentiment-related information from different modalities, which typically arises during the unimodal feature extraction phase and the multimodal feature fusion phase. Existing methods extract only shallow information from unimodal features during the extraction phase, neglecting sentimental differences across different personalities. During the fusion phase, they directly merge the feature information from each modality without considering differences at the feature level. This ultimately affects the model's recognition performance. To address this problem, we propose a personality-sentiment aligned multi-level fusion framework. We introduce personality traits during the feature extraction phase and propose a novel personality-sentiment alignment method to obtain personalized sentiment embeddings from the textual modality for the first time. In the fusion phase, we introduce a novel multi-level fusion method. This method gradually integrates sentimental information from textual, visual, and audio modalities through multimodal pre-fusion and a multi-level enhanced fusion strategy. Our method has been evaluated through multiple experiments on two commonly used datasets, achieving state-of-the-art results.

Paper Structure

This paper contains 15 sections, 15 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The proposed personality-sentiment aligned multi-level fusion model (PSA-MF): In the top left corner of the diagram is the main framework of the model, which includes feature extraction and cross-attention modal interaction. (a) shows the personality-sentiment alignment module, which includes personality-sentiment contrastive learning and personalized sentimental constraints. (b) displays the multimodal pre-fusion module of the model, which utilizes the deep layers of BERT as a multimodal encoder for initial alignment across three modalities. (c) depicts the enhanced fusion module, which performs serial fusion and parallel fusion of the multi-level features from the upper-layer cross-attention, culminating in the final prediction.
  • Figure 2: Acc2 and F1 for personality-sentiment alignment applied at different layers on the CMU-MOSI.