Dynamic Multimodal Sentiment Analysis: Leveraging Cross-Modal Attention for Enabled Classification

Hui Lee; Singh Suniljit; Yong Siang Ong

Dynamic Multimodal Sentiment Analysis: Leveraging Cross-Modal Attention for Enabled Classification

Hui Lee, Singh Suniljit, Yong Siang Ong

TL;DR

The paper tackles multimodal sentiment analysis by integrating text, audio, and visual data to capture cross-modal interactions. It evaluates three fusion strategies within a transformer-based ModalityTransformer framework on the CMU-MOSEI dataset, comparing late-stage fusion, early-stage fusion, and cross-modal attention. Early-stage fusion achieves 71.87% accuracy, while multi-headed attention reaches 72.39%, with late fusion lagging at 66.23%, indicating that early fusion substantially improves sentiment classification and attention provides only marginal gains in this setup. The findings highlight the value of early multimodal integration and point to future work in dynamic temporal fusion and adaptive feature weighting for further performance gains.

Abstract

This paper explores the development of a multimodal sentiment analysis model that integrates text, audio, and visual data to enhance sentiment classification. The goal is to improve emotion detection by capturing the complex interactions between these modalities, thereby enabling more accurate and nuanced sentiment interpretation. The study evaluates three feature fusion strategies -- late stage fusion, early stage fusion, and multi-headed attention -- within a transformer-based architecture. Experiments were conducted using the CMU-MOSEI dataset, which includes synchronized text, audio, and visual inputs labeled with sentiment scores. Results show that early stage fusion significantly outperforms late stage fusion, achieving an accuracy of 71.87\%, while the multi-headed attention approach offers marginal improvement, reaching 72.39\%. The findings suggest that integrating modalities early in the process enhances sentiment classification, while attention mechanisms may have limited impact within the current framework. Future work will focus on refining feature fusion techniques, incorporating temporal data, and exploring dynamic feature weighting to further improve model performance.

Dynamic Multimodal Sentiment Analysis: Leveraging Cross-Modal Attention for Enabled Classification

TL;DR

Abstract

Paper Structure (20 sections, 5 figures, 2 tables)

This paper contains 20 sections, 5 figures, 2 tables.

Introduction
Approach
Related Work
Data Collection
Preprocessing in CMU MOSEI dataset
Modality Transformer Architecture
Feature Fusion Approach
Approach 0: Late Stage Feature Fusion
Approach 1: Early Stage Feature Fusion
Approach 2: Multi-headed Attention
Experiments and Results
Experimental Setup
Results
Model for Each Modality
Approach 0: Late Stage Feature Fusion Results
...and 5 more sections

Figures (5)

Figure 1: Training and validation loss and accuracy for video model
Figure 2: Training and validation loss and accuracy for text model
Figure 3: Training and validation loss and accuracy for audio model
Figure 4: Training and validation loss and accuracy for early stage feature fusion model
Figure 5: Training and validation loss and accuracy for multi-headed attention model

Dynamic Multimodal Sentiment Analysis: Leveraging Cross-Modal Attention for Enabled Classification

TL;DR

Abstract

Dynamic Multimodal Sentiment Analysis: Leveraging Cross-Modal Attention for Enabled Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (5)