Table of Contents
Fetching ...

Holistic Visual-Textual Sentiment Analysis with Prior Models

Junyu Chen, Jie An, Hanjia Lyu, Christopher Kanan, Jiebo Luo

TL;DR

The paper tackles visual-textual sentiment analysis, a task made challenging by diverse image domains and embedded textual cues. It introduces VSA-PF, a holistic framework that fuses a trainable visual-textual branch, a visual-expert prior branch, a CLIP-based branch, and a BERT-driven multimodal fusion module. Key contributions include leveraging a rich set of pre-trained priors (Swin Transformer, BERTweet, Facenet, YOLOv5, Place365, OCR) and CLIP to model cross-modal associations, along with a Transformer-based fusion that outperforms state-of-the-art on MVSA and TumEmo. Ablation studies confirm the necessity of each component, with OCR and textual priors playing pivotal roles; the approach shows strong practical potential for social-media sentiment analysis and multimodal understanding. The work also opens avenues for integrating with large language or vision models to further enhance performance on complex multimodal sentiment tasks.

Abstract

Visual-textual sentiment analysis aims to predict sentiment with the input of a pair of image and text, which poses a challenge in learning effective features for diverse input images. To address this, we propose a holistic method that achieves robust visual-textual sentiment analysis by exploiting a rich set of powerful pre-trained visual and textual prior models. The proposed method consists of four parts: (1) a visual-textual branch to learn features directly from data for sentiment analysis, (2) a visual expert branch with a set of pre-trained "expert" encoders to extract selected semantic visual features, (3) a CLIP branch to implicitly model visual-textual correspondence, and (4) a multimodal feature fusion network based on BERT to fuse multimodal features and make sentiment predictions. Extensive experiments on three datasets show that our method produces better visual-textual sentiment analysis performance than existing methods.

Holistic Visual-Textual Sentiment Analysis with Prior Models

TL;DR

The paper tackles visual-textual sentiment analysis, a task made challenging by diverse image domains and embedded textual cues. It introduces VSA-PF, a holistic framework that fuses a trainable visual-textual branch, a visual-expert prior branch, a CLIP-based branch, and a BERT-driven multimodal fusion module. Key contributions include leveraging a rich set of pre-trained priors (Swin Transformer, BERTweet, Facenet, YOLOv5, Place365, OCR) and CLIP to model cross-modal associations, along with a Transformer-based fusion that outperforms state-of-the-art on MVSA and TumEmo. Ablation studies confirm the necessity of each component, with OCR and textual priors playing pivotal roles; the approach shows strong practical potential for social-media sentiment analysis and multimodal understanding. The work also opens avenues for integrating with large language or vision models to further enhance performance on complex multimodal sentiment tasks.

Abstract

Visual-textual sentiment analysis aims to predict sentiment with the input of a pair of image and text, which poses a challenge in learning effective features for diverse input images. To address this, we propose a holistic method that achieves robust visual-textual sentiment analysis by exploiting a rich set of powerful pre-trained visual and textual prior models. The proposed method consists of four parts: (1) a visual-textual branch to learn features directly from data for sentiment analysis, (2) a visual expert branch with a set of pre-trained "expert" encoders to extract selected semantic visual features, (3) a CLIP branch to implicitly model visual-textual correspondence, and (4) a multimodal feature fusion network based on BERT to fuse multimodal features and make sentiment predictions. Extensive experiments on three datasets show that our method produces better visual-textual sentiment analysis performance than existing methods.
Paper Structure (14 sections, 4 figures, 4 tables)

This paper contains 14 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Examples from a visual-textual sentiment dataset, where image and text are used to jointly express the sentiment. Clearly, it is challenging to predict the sentiment only from the image in some cases.
  • Figure 2: Overview of the proposed method’s architecture and training procedure. Our framework (c) consists of four parts: (1) A visual-textual branch to learn visual and textual features for sentiment prediction, (2) A visual expert branch to equip the method with a strong visual prior, (3) A CLIP branch to implicitly model the visual-textual correspondence with aligned embeddings, and (4) A multimodal feature fusion module to integrate all information and make holistic sentiment predictions. Initial training (a) fine-tunes the visual-textual branch separately on unimodal data to capture sentiment features before proceeding to multimodal training (b, c). The dataset splits remain consistent throughout the two training phases.
  • Figure 3: Examples from visual-textual sentiment datasets, where the sentiment is mainly revealed by image text. We give three instances where the OCR engine accurately detected the image text and helped improve the accuracy, along with one failure case in which the OCR engine failed to detect hand-written image text.
  • Figure 4: Examples from visual-textual sentiment datasets. Each column represents a branch and a corresponding data pair, where our VSA-PF model accurately predicts sentiments for these examples, but the removal of the branch results in incorrect prediction for the associated sample.