Table of Contents
Fetching ...

Enhancing Sentiment Analysis through Multimodal Fusion: A BERT-DINOv2 Approach

Taoxu Zhao, Meisi Li, Kehao Chen, Liye Wang, Xucheng Zhou, Kunal Chaturvedi, Mukesh Prasad, Ali Anaissi, Ali Braytee

TL;DR

The paper tackles multimodal sentiment analysis by fusing text and image information using a BERT-based textual encoder and a DINOv2-based visual encoder, projecting both modalities to 256-dimensional latent representations. It introduces three fusion schemes—Basic Fusion, Self-Attention Fusion, and Dual-Attention Fusion—to integrate cross-modal cues and improve sentiment prediction. Empirical results on Memotion 7k and MVSA datasets show competitive performance, with the Dual-Attention Fusion achieving a macro F1 of 0.3552 on Memotion 7k and MVSA-single accuracy of about 0.73, while MVSA-multi remains challenging relative to the state-of-the-art. The work demonstrates the effectiveness of attention-based fusion for text-image sentiment analysis and suggests avenues for extending multimodal analysis to additional modalities and more advanced fusion techniques.

Abstract

Multimodal sentiment analysis enhances conventional sentiment analysis, which traditionally relies solely on text, by incorporating information from different modalities such as images, text, and audio. This paper proposes a novel multimodal sentiment analysis architecture that integrates text and image data to provide a more comprehensive understanding of sentiments. For text feature extraction, we utilize BERT, a natural language processing model. For image feature extraction, we employ DINOv2, a vision-transformer-based model. The textual and visual latent features are integrated using proposed fusion techniques, namely the Basic Fusion Model, Self Attention Fusion Model, and Dual Attention Fusion Model. Experiments on three datasets, Memotion 7k dataset, MVSA single dataset, and MVSA multi dataset, demonstrate the viability and practicality of the proposed multimodal architecture.

Enhancing Sentiment Analysis through Multimodal Fusion: A BERT-DINOv2 Approach

TL;DR

The paper tackles multimodal sentiment analysis by fusing text and image information using a BERT-based textual encoder and a DINOv2-based visual encoder, projecting both modalities to 256-dimensional latent representations. It introduces three fusion schemes—Basic Fusion, Self-Attention Fusion, and Dual-Attention Fusion—to integrate cross-modal cues and improve sentiment prediction. Empirical results on Memotion 7k and MVSA datasets show competitive performance, with the Dual-Attention Fusion achieving a macro F1 of 0.3552 on Memotion 7k and MVSA-single accuracy of about 0.73, while MVSA-multi remains challenging relative to the state-of-the-art. The work demonstrates the effectiveness of attention-based fusion for text-image sentiment analysis and suggests avenues for extending multimodal analysis to additional modalities and more advanced fusion techniques.

Abstract

Multimodal sentiment analysis enhances conventional sentiment analysis, which traditionally relies solely on text, by incorporating information from different modalities such as images, text, and audio. This paper proposes a novel multimodal sentiment analysis architecture that integrates text and image data to provide a more comprehensive understanding of sentiments. For text feature extraction, we utilize BERT, a natural language processing model. For image feature extraction, we employ DINOv2, a vision-transformer-based model. The textual and visual latent features are integrated using proposed fusion techniques, namely the Basic Fusion Model, Self Attention Fusion Model, and Dual Attention Fusion Model. Experiments on three datasets, Memotion 7k dataset, MVSA single dataset, and MVSA multi dataset, demonstrate the viability and practicality of the proposed multimodal architecture.

Paper Structure

This paper contains 14 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The overall architecture of the proposed framework (Above). The fusion methodology of the framework (Below): a) Basic Fusion Model; b) Self-Attention Fusion Model; c) Dual-Attention Fusion Model
  • Figure 2: Breakdown performance for the best model for Memotion 7k dataset
  • Figure 3: Breakdown performance for the best model for MVSA-multi dataset
  • Figure : Example 1
  • Figure : Example 1
  • ...and 1 more figures