Holistic Visual-Textual Sentiment Analysis with Prior Models
Junyu Chen, Jie An, Hanjia Lyu, Christopher Kanan, Jiebo Luo
TL;DR
The paper tackles visual-textual sentiment analysis, a task made challenging by diverse image domains and embedded textual cues. It introduces VSA-PF, a holistic framework that fuses a trainable visual-textual branch, a visual-expert prior branch, a CLIP-based branch, and a BERT-driven multimodal fusion module. Key contributions include leveraging a rich set of pre-trained priors (Swin Transformer, BERTweet, Facenet, YOLOv5, Place365, OCR) and CLIP to model cross-modal associations, along with a Transformer-based fusion that outperforms state-of-the-art on MVSA and TumEmo. Ablation studies confirm the necessity of each component, with OCR and textual priors playing pivotal roles; the approach shows strong practical potential for social-media sentiment analysis and multimodal understanding. The work also opens avenues for integrating with large language or vision models to further enhance performance on complex multimodal sentiment tasks.
Abstract
Visual-textual sentiment analysis aims to predict sentiment with the input of a pair of image and text, which poses a challenge in learning effective features for diverse input images. To address this, we propose a holistic method that achieves robust visual-textual sentiment analysis by exploiting a rich set of powerful pre-trained visual and textual prior models. The proposed method consists of four parts: (1) a visual-textual branch to learn features directly from data for sentiment analysis, (2) a visual expert branch with a set of pre-trained "expert" encoders to extract selected semantic visual features, (3) a CLIP branch to implicitly model visual-textual correspondence, and (4) a multimodal feature fusion network based on BERT to fuse multimodal features and make sentiment predictions. Extensive experiments on three datasets show that our method produces better visual-textual sentiment analysis performance than existing methods.
