Table of Contents
Fetching ...

SentiFormer: Metadata Enhanced Transformer for Image Sentiment Analysis

Bin Feng, Shulan Ruan, Mingzheng Yang, Dongxuan Han, Huijie Liu, Kai Zhang, Qi Liu

TL;DR

SentiFormer tackles image sentiment analysis by integrating image data with rich metadata (captions, object tags, and scene tags) through prompt learning and CLIP-based unified embeddings. It introduces adaptive relevance learning to weight metadata contributions and a cross-modal transformer to fuse refined image and metadata representations for sentiment prediction. The approach achieves state-of-the-art results on FI and Twitter_LDL and demonstrates zero-shot capability on Artphoto, with ablation analyses confirming the necessity of each component. Additionally, metadata-enhanced datasets are released to spur further research in multimodal sentiment analysis.

Abstract

As more and more internet users post images online to express their daily emotions, image sentiment analysis has attracted increasing attention. Recently, researchers generally tend to design different neural networks to extract visual features from images for sentiment analysis. Despite the significant progress, metadata, the data (e.g., text descriptions and keyword tags) for describing the image, has not been sufficiently explored in this task. In this paper, we propose a novel Metadata Enhanced Transformer for sentiment analysis (SentiFormer) to fuse multiple metadata and the corresponding image into a unified framework. Specifically, we first obtain multiple metadata of the image and unify the representations of diverse data. To adaptively learn the appropriate weights for each metadata, we then design an adaptive relevance learning module to highlight more effective information while suppressing weaker ones. Moreover, we further develop a cross-modal fusion module to fuse the adaptively learned representations and make the final prediction. Extensive experiments on three publicly available datasets demonstrate the superiority and rationality of our proposed method.

SentiFormer: Metadata Enhanced Transformer for Image Sentiment Analysis

TL;DR

SentiFormer tackles image sentiment analysis by integrating image data with rich metadata (captions, object tags, and scene tags) through prompt learning and CLIP-based unified embeddings. It introduces adaptive relevance learning to weight metadata contributions and a cross-modal transformer to fuse refined image and metadata representations for sentiment prediction. The approach achieves state-of-the-art results on FI and Twitter_LDL and demonstrates zero-shot capability on Artphoto, with ablation analyses confirming the necessity of each component. Additionally, metadata-enhanced datasets are released to spur further research in multimodal sentiment analysis.

Abstract

As more and more internet users post images online to express their daily emotions, image sentiment analysis has attracted increasing attention. Recently, researchers generally tend to design different neural networks to extract visual features from images for sentiment analysis. Despite the significant progress, metadata, the data (e.g., text descriptions and keyword tags) for describing the image, has not been sufficiently explored in this task. In this paper, we propose a novel Metadata Enhanced Transformer for sentiment analysis (SentiFormer) to fuse multiple metadata and the corresponding image into a unified framework. Specifically, we first obtain multiple metadata of the image and unify the representations of diverse data. To adaptively learn the appropriate weights for each metadata, we then design an adaptive relevance learning module to highlight more effective information while suppressing weaker ones. Moreover, we further develop a cross-modal fusion module to fuse the adaptively learned representations and make the final prediction. Extensive experiments on three publicly available datasets demonstrate the superiority and rationality of our proposed method.

Paper Structure

This paper contains 14 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The overall architecture of SentiFormer.
  • Figure 2: Visualization of feature distribution on eight categories before (left) and after (right) model processing.
  • Figure 3: Sensitivity study of SentiFormer on different depth.